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ABSTRACT 


Boundary  Kernel  Estimation 

of  the  Two  Sample  Comparison  Density  Function.  (May  1989) 

William  Pyle  Alexander,  B.S.A.,  University  of  Arkansas 
Chair  of  Advisory  Committee:  Dr.  Emanuel  Parzen 

The  focus  of  this  work  is  to  derive  functional  and  graphical  statistical  tech¬ 
niques  for  the  two  sample  problem  suitable  for  implementation  in  modern  com¬ 
puting  environments.  In  the  two  sample  problem,  it  is  desired  to  test  the  null 
hypothesis  that  two  independent  random  samples  have  a  common  distribution 
function.  Assuming  certain  conditions  on  the  distribution  functions,  a  procedure 
is  proposed  which  has  strong  graphical  elements,  a  sound  theoretical  founda¬ 
tion,  and  estimates  the  relation  of  the  two  distributions  if  the  null  hypothesis 
is  rejected.  The  proposed  procedure  has  as  its  motivation  the  estimation  of  the 
comparison  density  and  inference  concerning  its  uniformity. 

The  proposed  procedure  is  both  a  statistical  test  of  the  null  hypothesis  and 
a  model  selection  criterion.  The  test  is  based  on  components  of  a  new  stochastic 
process  which  is  termed  the  kernel  density  process.  This  process  is  based  on  a 
boundary  kernel  estimate  of  the  comparison  density.  It  is  proposed  to  apply  a 
new  test,  the  subset  chi-square  test,  to  these  components.  If  the  null  hypothesis  is 
rejected,  the  components  found  to  be  significant  are  used  to  construct  a  damped 
orthogonal  series  estimate  of  the  comparison  density. 

The  power  of  the  proposed  test  under  local  alternatives  is  compared  to  two 
commonly  used  portmanteau  statistics,  the  Cram4r-von  Mises  and  the  Anderson- 
Darling,  and  to  a  third  statistic  suggested  by  this  work.  A  new  method  for 
finding  the  power  of  these  statistics  under  local  alternatives  is  given.  This  method 
uses  the  fast  Fourier  transform  to  invert  an  approximation  to  the  characteristic 
function  of  the  statistic.  The  proposed  test  is  seen  to  have  good  power  properties. 
A  simulation  study  is  conducted  to  examine  its  small  sample  size.  Its  size  is  found 
to  remain  close  to  its  nominal  value. 
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1.  INTRODUCTION 


1.1.  The  Two  Sample  Problem 

The  statistical  analysis  of  two  samples  occupies  a  fundamental  role  in  statis¬ 
tics.  Data  are  collected  under  two  regimes.  Treatment  and  control,  two  time 
periods,  two  lots  of  goods,  two  levels  of  a  concomitant  variable,  or  two  formu¬ 
lations  of  a  product  are  just  a  few  examples  of  such  regimes.  In  each  case,  the 
researcher  wishes  to  know  whether  the  two  data  sets  arise  from  the  same  un¬ 
derlying  population.  That  is,  are  the  two  regimes  the  same?  The  number  of 
statistical  texts  at  all  levels  which  treat  the  problem  attests  to  the  fundamen¬ 
tal  nature  of  the  question:  see,  for  example,  Keller,  Warrack  and  Bartel  (1988) 
(undergraduate  level  methods),  Montgomery  (1984)  (graduate  level  methods), 
Hocking  (1985)  (graduate  level  linear  model  theory),  Randles  and  Wolfe  (1979) 
(graduate  level  nonparametric  theory),  and  Kendall  and  Stuart  (1979)  (graduate 
level  parametric  theory) . 

That  the  two  sample  problem  has  old  and  venerable  roots  can  be  seen  from 
the  writings  of  the  great  statistician  Sir  R.A.  Fisher.  Fisher  (1948),  page  122, 
contrasts  the  importance  of  testing  a  single  mean  versus  the  equality  of  two  means 
in  experimental  work  under  the  assumption  of  normality  and  equal  variances: 
“in  experimental  work  it  is  even  more  frequently  necessary  to  test  whether  two 
samples  differ  significantly  in  their  means,  or  whether  they  may  be  regarded  as 
having  arisen  from  the  same  population."  Fisher’s  comments  relate  not  only  the 
relative  importance  of  the  two  sample  problem,  but  also  something  of  its  age. 
The  statistic  Fisher  proceeds  to  discuss  is  the  well  known  Student’s  t.  Concerning 
the  distribution  of  t,  Fisher  notes  on  page  16  that  “it  is  equally  fortunate  that 
the  distribution  of  f,  first  established  by  ‘Student’  in  1908,  in  his  study  of  the 
probable  error  of  the  mean,  should  be  applicable,  not  only  to  the  case  there 
treated,  but  to  the  more  complex,  but  even  more  frequently  needed,  problem  of 
the  comparison  of  two  mean  values.” 

The  format  and  style  follows  that  of  The  Annals  of  Statistics. 
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While  dealing  with  the  two  sample  problem  in  various  contexts  over  the 
years,  statisticians  have  proposed  an  impressive  number  of  tests.  The  t-test, 
Wilcoxon,  median,  normal  scores,  Kolmogorov-Smirnov,  Anderson-Darling,  and 
Cram£r-von  Mises  are  just  a  few  of  many.  These  tests  range  from  making  quite 
specific  assumptions  on  the  nature  of  the  two  distributions  to  almost  none.  The 
sorts  of  assumptions  commonly  made  are  discussed  in  Section  2. 

Formally,  the  two  sample  problem  can  be  stated  in  the  following  terms.  Let 
X  =  {XU,..  ,  Xm)  be  a  random  sample  from  a  population  with  distribution  func¬ 
tion  F(x)  =  P\X{  <  x]  for  *  =  1, . . . ,  m.  Let  Y  =  (Pi, . . . ,  Yn)  be  an  independent 
random  sample  from  a  population  with  distribution  function  G(x )  =  P[Yj  <  x\ 
for  j  =  l,...,n.  Stating  Xi, ,  Xm  is  a  random  sample  from  F  means  that 
Xi, . . .  ,Xm  are  independent  and  identically  distributed  (iid)  according  to  the 
distribution  function  F.  These  properties  are  nearly  universally  assumed.  Some 
work  has  been  done  to  relax  the  assumption  of  independence  within  each  sample 
[see,  for  example,  Harpaz  (1985)].  However,  in  this  work  the  usual  assumption 
of  independence  will  be  made.  The  mathematical  representation  of  the  null  hy¬ 
pothesis  that  the  two  samples  arise  from  the  same  population  is  H 0\F  =  G.  By 
this  is  meant  that  F(x)  =  G(x)  for  -oo  <  x  <  oo. 

Since  there  already  exists  such  a  plethora  of  two  sample  tests,  it  is  only 
natural  to  ask  why  another  is  needed.  The  answer  to  this  question  is  threefold. 
First,  it  is  the  goal  of  this  research  to  derive  a  unified  data  exploratory  method. 
That  is,  a  methodology  is  sought  which  makes  minimal  assumptions  on  F  and 
G  a  priori  and  will  to  the  greatest  extent  possible  let  the  data  determine  the 
outcome. 

Second,  as  typically  implemented,  two  sample  techniques  possess  no  graph¬ 
ical  elements.  They  are  statistical  tests  which  return  only  an  accept  or  reject 
response.  The  advent  of  the  personal  computer  and  workstations  brings  the  po¬ 
tential  to  substantially  alter  the  way  in  which  statistical  analysis  is  conducted.  In 
particular,  there  is  great  potential  for  a  more  graphical,  exploratory  and  flexible 
approach  to  data  analysis.  Packages  such  as  Timeslob  [Newton  (1988)]  for  the 
IBM  personal  computer  family  and  5  [Becker,  Chambers,  and  Wilks  (1988)]  for 
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UNIX  workstations  are  appearing  to  realize  this  potential.  These  new  computing 
environments  open  broad  new  areas  of  statistical  research  into  methodologies  to 
exploit  them.  This  research  shall  develop  tests  and  estimators  for  the  two  sample 
problem  which  are  better  suited  to  these  environments  than  existing  ones. 

Third,  the  relevant  standard  two  sample  statistics  give  no  indication  of  the 
actual  relation  of  F  to  G  should  the  null  hypothesis  be  rejected.  This  shall  be 
seen  in  Section  2.  The  most  natural  question  to  ask  after  the  null  hypothesis  is 
rejected  is  “Why  was  it  rejected?" .  Yet  surprisingly,  most  techniques  are  silent 
on  this  point.  Graphics  enter  here  as  the  natural  response.  A  graph  should  be  an 
estimate  of  some  sort  of  the  relation  of  F  to  G.  In  this  context,  many  practicing 
statisticians  might  plot  the  empirical  distribution  functions  of  the  two  samples. 
This  and  similar  procedures  are,  at  best,  ad  hoc .  A  particular  statistic  rejects  H0 
and  one  then  proceeds  to  examine  a  picture  of  functions  which  are  either  step 
functions  or  piece-wise  continuous  to  try  to  discern  why.  What  is  sought  here 
is  a  unified  approach  where  the  graph  and  the  test  are  derived  from  a  common 
foundation. 

In  summary,  the  purpose  of  this  research  is  to  derive  a  new  procedure  for 
the  two  sample  problem.  This  is  to  be  a  computer  intensive,  graphical  and  data- 
exploratory  procedure  which  makes  minimal  assumptions  on  the  character  of  the 
distribution  functions,  F  and  G.  Further,  as  part  of  the  framework,  should  the 
null  hypothesis  be  rejected,  it  is  required  that  some  explanation  be  given.  This 
explanation,  in  the  form  of  a  graph,  should  describe  the  relation  of  F  and  G. 

1.2.  Outline  of  This  Dissertation 

This  dissertation  is  divided  into  six  sections  and  two  appendices.  Section 
2  is  a  review  of  the  literature.  Existing  approaches  to  the  two  sample  problem 
are  examined  first.  Such  approaches  include  parametric  and  nonpar ametric  tests 
and  tests  against  specific  and  general  alternatives.  Linear  rank  statistics  and  the 
Cram4r-von  Mises  and  Anderson- Darling  statistics  are  examined  in  depth.  The 
comparison  distribution  and  density  functions  and  related  stochastic  processes 
are  defined  and  their  properties  enumerated.  Existing  techniques  based  on  these 


quantities  are  reviewed.  It  is  seen  that  a  nonparametric  method  is  appropriate 
for  estimating  the  comparison  density.  This  leads  to  an  in-depth  review  of  such 
methods  and  the  selection  of  the  Gasser-Muller  boundary  kernel. 

Section  3  examines  the  properties  of  the  Gasser-Muller  boundary  kernel  es¬ 
timator  of  the  comparison  density.  Conditions  for  its  asymptotic  normality  un¬ 
der  H0  and  pointwise  consistency  under  any  alternative  are  derived  when  the 
bandwidth  shrinks  to  zero.  The  kernel  density  process  is  defined  and  its  weak 
convergence  to  a  Gaussian  process  is  proved  for  a  fixed  bandwidth.  Components 
of  this  process  are  defined  in  terms  of  the  inner  product  of  the  process  and  the 
eigenfunctions  of  its  covariance  kernel  under  H0.  These  components  are  shown  to 
be  appropriate  for  testing  the  null  hypothesis.  They  have  interpretations  both  as 
generalized  Fourier  coefficients  and  as  rank  statistics.  Their  asymptotic  distribu¬ 
tion  under  H0  is  derived.  A  new  test,  called  the  subset  chi-square  test,  is  applied 
to  the  components.  This  test,  in  turn,  suggests  an  orthogonal  series  estimator  of 
the  comparison  density  based  on  the  eigenfunctions.  Finally,  recommendations 
for  the  choice  of  bandwidth  for  the  boundary  kernel  estimator  are  made. 

Section  4  examines  the  power  and  size  of  the  methods  derived  in  Section 
3.  Conditions  for  the  weak  convergence  of  the  kernel  density  process  under 
local  alternatives  are  established.  Power  functions  for  the  Cram£r-von  Mises  and 
Anderson-Darling  statistics  are  found  by  using  a  fast  Fourier  transform  (FFT) 
to  numerically  invert  the  characteristic  function.  Power  functions  for  the  subset 
chi-square  test  are  found  by  simulation.  The  methods  of  Section  3  are  seen  to 
have  very  good  power  characteristics.  Since  the  subset  chi-square  test  is  based 
on  the  asymptotic  distribution  of  the  components,  a  simulation  is  conducted  to 
gauge  the  size  of  the  test  in  small  samples.  The  procedure  is  found  to  maintain 
a  reasonable  size  even  for  small  samples. 

Section  5  applies  the  techniques  of  Section  3  to  two  data  sets.  One  data  set 
consists  of  observed  data  and  the  other  of  simulated  data.  Section  6  presents  a 
summary  and  conclusions  and  outlines  areas  of  future  research.  Appendix  A  is 
a  glossary  of  notation  and  Appendix  B  gives  proofs  of  the  theorems  stated  in 
Sections  3  and  4. 
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2.  REVIEW  OF  THE  LITERATURE 


2.1.  Introduction 

Section  2  is  a  review  of  the  existing  methodologies  which  this  research  touches 
or  builds  upon.  Subsection  2.2  reviews  the  methods  of  two  sample  analysis  as 
usually  employed  by  the  statistics  community.  The  subsection  closes  with  a 
statement  as  to  the  criteria  any  methodology  derived  in  this  work  must  meet 
and  the  rationale  for  these  criteria. 

Subsection  2.3  reviews  a  concept  known  as  the  comparison  density.  It  is 
argued  that  a  methodology  based  on  the  comparison  density  fulfills  the  criteria 
outlined  in  Subsection  2.2.  The  stochastic  processes  upon  which  any  technique 
must  be  based  are  reviewed  and  links  between  them  refined.  Existing  techniques 
of  estimating  the  comparison  density  are  reviewed.  Each  of  these  will  be  com¬ 
pared  to  the  criteria  outlined  in  Subsection  2.2.  They  will  be  seen  to  fall  short 
of  fulfilling  all  the  criteria  outlined,  but  will  prove  to  be  valuable  stepping  stones 
in  this  work. 

Subsection  2.4  reviews  nonparametric  density  estimation  techniques  for  den¬ 
sities  having  support  [0,1].  Such  a  technique  will  be  employed  to  estimate  the 
comparison  density  in  this  work.  Density  estimates  on  compact  support  pose 
special  problems  which  will  be  discussed  in  detail.  At  the  end  of  the  subsection, 
an  estimation  technique  is  selected. 

2.2.  Review  of  Two  Sample  Techniques 

2.2.1.  Introduction .  Given  a  random  sample,  X\, . . . ,  Xm ,  from  a  continuous 
distribution  function,  F,  and  an  independent  random  sample,  Y\,...,Yn,  from 
a  continuous  distribution  function,  G ,  it  is  desired  to  test  the  null  hypothesis 
H o-F  =  G.  In  this  subsection,  existing  tests  of  this  hypothesis  are  reviewed. 
There  are  many  tests  that  have  been  suggested  and  used  over  the  years.  Some  are 
applicable  to  very  special  and  specific  distributional  assumptions  concerning  the 
two  samples.  Others  are  applicable  to  very  general  cases.  Yet,  as  implemented, 
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none  lend  themselves  fully  to  the  type  of  computer  intensive,  graphical,  functional 
and  data  exploratory  approach  outlined  in  Section  1. 

There  are  two  schemes  by  which  two  sample  tests  can  be  classified  which  will 
be  discussed  here.  The  first  method  classifies  a  test  as  to  whether  it  is  parametric 
or  nonpar ametric.  The  second  classifies  a  test  as  to  whether  the  alternate  hy¬ 
pothesis  is  general  or  specific.  These  are  discussed  in  Subsections  2.2.2  and  2.2.3, 
respectively.  Subsection  2.2.4  reviews  rank  statistics,  which  are  nonparametric 
tests  against  specific  alternatives.  Subsection  2.2.5  reviews  nonparametric  tests 
against  general  alternatives.  Having  reviewed  the  commonly  used  methodologies, 
a  list  of  criteria  for  the  ideal  method  is  proposed  in  Subsection  2.2.6.  This  list 
summarizes  the  desirable  properties  a  methodology  should  have  to  more  fully 
utilize  computer  and  graphic  intensive  modes  for  data  analysis. 

2.2.2.  Parametric  and  Nonparametric  Tests.  Tests  can  be  classified  as  to 
whether  they  are  parametric  or  nonparametric.  In  this  subsection,  parametric 
and  nonparametric  tests  are  compared.  It  is  seen  that  parametric  tests  are  too 
restrictive  for  the  goals  of  this  research.  Any  tests  derived  will  be  nonparametric 
in  nature. 

Parametric  tests  assume  that  both  F  and  G  belong  to  a  family  of  distribu¬ 
tions,  7g,  which  is  indexed  by  a  parameter  0.  That  is,  one  can  write 

h  =  {F(z)  =  F(z;tl);G(x)  =  C(x;02) :  »US2  €  0  C  Rl>. 

It  is  usually  assumed  that  7$  is  indexed  in  such  a  manner  that  for  F(x;0\)  and 
G(x;0 2)  in  7$,  one  has  F(z;$i)  =  G{x\0 2)  for  all  z  if  and  only  if  0\  =  &2-  This 
uniqueness  property  permits  the  reduction  of  the  general  two  sample  hypothesis 
of  H0:  F  =  G  to  H0: 0\  =  #2- 

With  the  two  sample  hypothesis  reduced  to  testing  the  equality  of  two  vec¬ 
tor  valued  parameters,  standard  techniques  from  parametric  inference  may  be 
brought  to  bear  upon  the  problem.  Tests  such  as  the  likelihood  ratio,  efficient 
score,  Wald,  and  uniformly  most  powerful  unbiased  may  potentially  be  derived. 
See  Silvey  (1975),  Kendall  and  Stuart  (1979)  or  Bickel  and  Doksum  (1977)  for 
background  on  these. 
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As  an  example  of  a  parametric  test,  consider  the  two  sample  t-test.  Let 
9  =  (/*,cr2)  and  $(x)  be  the  standard  normal  distribution  function.  Define  the 
parametric  family  as, 

?6  =  {F(x)  =  $((x  -  m )/a)\  G(x)  =  $ ((x  -  H2 )l°)  ••  Mi,  M2  €  3R;  a2  >  0}. 

The  null  hypothesis  is  H0:  Hi  =  M2*  A.  likelihood  ratio  test  rejects  H0  if 
A=  sup  L(mi,M2,*2)/  sup  L(h\>H2>  ff2) 

Mi=Ma€ll,ff*>0  Mi,MaeR,<72>0 

is  too  small,  where  2/(mi»M2»<j2)  is  the  joint  likelihood  function  of  (mi,M2,o) 
given  X\, . . . ,  Xm  and  Yi, .  •  • ,  Yn  and  is  given  by 

m  n 

MW.W.®2)  =  (2Tg2)(n^)/2  -  "I)'  +  D*  -  «)')]• 

The  likelihood  function  gives  an  instantaneous  measure  of  the  probability  content 
at  the  point  (mi,M2,o2)  given  the  data  xi,...,xm,  yi,...,yn*  In  the  case  at 
hand,  it  is  not  hard  to  show  that  rejecting  H©  for  small  values  of  A  is  equivalent 
to  rejecting  H0  for  large  values  of 

X-Y 

where  S2  =  n+^_2(E^i(^»  ~  *)2  +  “  Y)2)  is  the  sample  pooled 

variance  and  X  and  Y  are  the  means  of  the  first  and  second  samples,  respectively. 
This  statistic  is  the  standard  two  sample  t-statistic. 

Most  parametric  tests  are  generated  in  an  analogous  manner.  One  starts  by 
writing  down  the  likelihood  function.  The  likelihood  ratio,  Wald  and  efficient 
scores  tests  are  constructed  around  the  maximum  likelihood  estimates  of  the 
parameters.  Uniformly  most  powerful  unbiased  tests  can  be  obtained  in  special 
cases  and  require  an  appeal  to  certain  theorems  detailing  their  existence  and 
construction. 

Nonparametric  tests  assume  that  F  and  G  lie  in  a  class,  J,  which  is  so 
broad  that  it  cannot  be  indexed  by  a  finite  dimensional  parameter.  An  example 
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of  such  a  class  is  the  class  of  all  continuous  distribution  functions.  Nonparamet- 
ric  techniques  rely  heavily  on  transformations  of  the  original  random  variables, 
Xi, . . . ,  Xm,  Ylt . . . ,  Kn,  to  new  random  variables,  R\, . . . ,  Rm,  . . . ,  5n,  such 
that  when  the  null  hypothesis  is  true  these  new  random  variables  always  have  the 
same,  known  distribution.  A  set  of  random  variables  possessing  this  property  is 
said  to  be  nonparametric  distribution-free  under  the  null  hypothesis.  Tests  of  H0 
are  then  based  on  . . . ,  . . . ,  S „  rather  than  on  . . . ,  .  * . , 

An  example  of  such  a  transformation  is  the  rank  transformation.  For  the  rank 
transformation,  R+  is  the  rank  of  Xt  and  5,  is  the  rank  of  Yx  in  the  pooled 
or  combined  sample.  When  one  pools  the  sample,  Xj, . . . ,  Xm  and  Yi, . . . ,  Yn 
are  treated  as  being  from  one  random  sample,  Xj,...,Xm, Y\x...tYn  of  size 

N  —  m  +  n.  The  rank,  iZ,,  is  given  by 

m  n 

«.  =  £  '(*.  <  x,) + y.  nx,  <  Yj), 

i= i 

where  /(-)  is  an  indicator  function  which  is  1  if  the  condition  in  parenthe¬ 
ses  is  true  and  zero  otherwise.  There  should  be  no  ties  in  the  ranks  since 
it  is  assumed  that  F  and  G  are  continuous.  Under  the  null  hypothesis, 

are  uniformly  distributed  over  all  N\  permutations  of 
1, . . . ,  N  and  fZ,  is  marginally  distributed  as  uniform  over  1, . . . ,  N  [see  Lehmann 
(1975),  page  58,  for  example]. 

The  rank  transform  and  tests  based  upon  it  are  examined  in  detail  in  Sub¬ 
section  2.2.4.  For  now,  consider  the  Wilcoxon  statistic  which  is  given  by 

m 

3  =  1 

As  the  ranks  are  uniformly  distributed  over  the  Nl  permutations  under  H0,  it 
can  be  shown  (see  Randles  and  Wolfe  (1979),  page  45}  that  E[W]  =  m(N  +  l)/2 
and  Var(W]  =  nm(N  -t- 1)/12.  The  full  distribution  of  W  under  H0  can  be  found 
by  enumeration  or  by  asymptotic  approximation. 

There  are  three  issues  to  be  addressed  in  any  discussion  of  parametric  versus 
nonparametric  tests.  These  are  specification  error,  size,  and  power.  A  specifica¬ 
tion  error  occurs  whenever  F  or  G  does  not  fall  in  the  assumed  parametric  class, 


Jg.  The  size  of  a  test  is  the  probability  of  rejecting  the  null  hypothesis  when  it 
is  true.  The  power  of  a  test  is  the  probability  of  rejecting  the  null  hypothesis 
when  it  is  false.  The  tradition  in  statistics  since  the  time  of  Fisher  and  Neyman 
has  been  to  fix  the  size  at  some  small  value.  For  a  given  size,  one  then  prefers  a 
test  which  is  more  powerful.  In  most  situations,  there  won’t  exist  a  test  which  is 
uniformly  more  powerful  (UMP)  than  any  other  test.  There  usually  aren’t  even 
what  are  called  uniformly  most  powerful  unbiased  tests  (UMPU).  Such  a  test, 
when  it  exists,  is  most  powerful  in  the  class  of  unbiased  tests.  A  test  is  unbiased 
if  it  has  power  greater  than  its  size  for  all  alternatives. 

In  the  current  context,  these  issues  will  be  discussed  via  Table  1.  Table  1 
gives  the  size  and  power  of  the  two  sample  t-test  and  the  Wilcoxon  test  for  4 
choices  of  m  and  n  and  5  choices  of  {ii\  —  H2)/0  a^d  for  7g  equal  to  the  normal 
distribution  family  and  to  the  Cauchy  distribution  family.  This  table  is  a  partial 
reproduction  of  a  table  found  on  page  118  of  Randles  and  Wolfe  (1979).  The 
size  of  the  test  falls  under  the  column  value  of  0  for  {m  —  m)/*?-  Each  test  is 
conducted  to  have  nominal  size  0.05.  The  power  of  the  test  is  given  under  the 
remaining  columns,  for  each  choice  of  family,  in  increasing  values  of  (^1  -  Hz) /cr- 
This  table  was  created  by  simulation  methods;  see  Randles  and  Wolfe  for  further 
details  on  its  construction. 

If  a  parametric  assumption  is  valid,  one  expects  the  appropriate  parametric 
test  to  be  at  least  as  powerful  as  a  nonparametric  test.  This  is  so  simply  because 
one  is  bringing  more  information  to  bear  on  the  problem.  The  nonparametric 
test  must  protect  against  a  huge  array  of  possible  underlying  F  and  G  which  the 
parametric  test  ignores.  This  is  borne  out  by  Table  1  where  the  t-test  is  seen  to 
be  more  powerful  than  the  Wilcoxon  for  the  normal  family.  However,  it  is  also 
important  to  notice  that  this  difference  is  slight. 

The  t-test  will  experience  specification  error  when  the  Cauchy  family  holds 
whereas  the  Wilcoxon  will  not.  The  implications  of  the  t-test  experiencing  a 
specification  error  are  demonstrated  under  the  Cauchy  heading  of  Table  1.  The 
size  of  the  t-test  shows  large  fluctuations  away  from  0.05.  The  Wilcoxon  test 
shows  no  such  failing.  Finally,  notice  that  the  Wilcoxon  test  is  now  much  more 
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Table  1 


Empirical  power  and  size  of  the  t-teat  and  the  Wilcoxon  linear  rank  test 
for  the  normal  and  Cauchy  families.  T  is  the  t-teat  and  W  is  the  Wilcoxon 
test. 


m 

n 

Test 

(mi 

-  M2 )!° 

0 

0.3 

0.6 

0.9 

1.2 

Normal 

5 

5 

T 

0.044 

0.111 

0.213 

0.356 

0.523 

W 

0.046 

0.108 

0.208 

0.346 

0.503 

15 

15 

T 

0.052 

0.206 

0.497 

0.785 

0.947 

W 

0.054 

0.205 

0.479 

0.766 

0.933 

5 

15 

T 

0.047 

0.144 

0.303 

0.511 

0.724 

W 

0.048 

0.141 

0.287 

0.492 

0.694 

15 

5 

T 

0.053 

0.149 

0.313 

0.518 

0.729 

W 

0.050 

0.140 

0.296 

0.499 

0.703 

Cauchy 

5 

5 

T 

0.024 

0.066 

0.132 

0.207 

0.288 

W 

0.051 

0.118 

0.218 

0.323 

0.408 

15 

15 

T 

0.030 

0.079 

0.153 

0.243 

0.333 

W 

0.046 

0.210 

0.484 

0.700 

0.839 

5 

15 

T 

0.056 

0.087 

0.137 

0.205 

0.282 

W 

0.046 

0.133 

0.284 

0.441 

0.576 

15 

5 

T 

0.061 

0.097 

0.146 

0.209 

0.279 

W 

0.046 

0.140 

0.297 

0.457 

0.590 

powerful  than  the  t-test.  The  superiority  of  the  Wilcoxon  over  the  t-test  is  much 
greater  in  this  case  than  that  of  the  t-test  over  the  Wilcoxon  when  normality 
holds. 

In  summary,  one  can  surmise  that  in  finite  samples,  parametric  tests  will 
often  outperform  nonparametric  tests  in  situations  where  the  parametric  test 
is  appropriate.  In  such  circumstances,  should  one  have  just  cause  to  assume 
a  parametric  model  one  should  surely  do  so.  However,  parametric  tests  are 
sensitive  to  specification  errors  which  nonparametric  tests  do  not  experience. 
As  it  is  the  goal  of  this  work  to  develop  a  procedure  which  applies  to  as  broad 
an  underlying  class  of  distributions  as  possible,  parametric  tests  will  not  be 


considered.  However,  the  methods  derived  will  be  competitive  with  parametric 
tests.  Nonparametric  tests  are  discussed  at  greater  length  in  Subsections  2.2.4 
and  2.2.5. 

2.2.3.  General  and  Specific  Alternative  Hypotheses.  In  this  subsection,  types 
of  alternative  hypotheses  are  examined.  The  alternative  hypothesis  specifies  the 
set  of  possible  relations  of  F  to  G  should  the  null  hypothesis  not  be  true.  Al¬ 
though  ignored  until  now,  the  alternative  hypothesis  must  always  be  stated.  The 
class  of  alternatives  considered  will  have  a  profound  effect  upon  the  properties 
of  a  test.  It  is  seen  in  this  subsection  that  for  the  type  of  procedure  to  be 
constructed,  a  general  and  not  specific  alternative  is  needed. 

For  two  sample  tests,  the  alternative  hypothesis  can  be  divided  into  roughly 
two  categories.  The  first  is  a  general  alternative  and  the  second  a  specific  alter¬ 
native.  A  general  alternative  leaves  the  fashion  in  which  F  and  G  are  related 
unspecified.  A  specific  alternative  will  place  some  structure  on  the  manner  in 
which  F  and  G  are  allowed  to  differ.  The  type  of  alternative  considered  relates 
back  to  the  underlying  class,  /,  to  which  F  and  G  belong.  A  few  examples 
should  shed  some  light  on  this.  Consider  any  parametric  test  so  that  F(z;®i), 
G(x\9 2)  €  ?$.  The  null  hypothesis  of  H0:  $i  =  62  is  complemented  by  an  alter¬ 
native  of  the  form  Ha:  ^  62  or  H„:  >  $2-  The  important  point  is  that  F  and 

G  are  still  in  even  under  the  alternative.  The  alternative  is  said  to  be  specific. 

Most  nonparametric  tests  are  constructed  against  specific  alternatives,  also. 
The  most  common  alternative  is  the  location  alternative.  For  the  location  alter¬ 
native,  the  class  of  distributions  is  defined  by 

7  ~  { Fix )  =  H(x);  G[x)  =  H[x  -  0)  :  H  is  a  continuous  d.f.,  6  €  1R}. 

The  null  hypothesis  reduces  to  H0: 9  =0,  yet  the  test  must  still  be  nonparamet- 
ric  because  H  is  any  continuous  distribution  function,  a  class  too  broad  to  be 
indexed.  The  alternative  hypothesis  in  this  setting  can  be  Ha:  9  ^  0  or  Ha:  0  >  0. 
The  alternative  is  specific  because  F  and  G  are  related  through  H  by  9.  A  third 
and  final  example  of  a  specific  alternative  is  the  scale  alternative.  The  family  for 
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scale  alternatives  is 

7  =  {F(x)  =  H(x)\G{x)  =  H(x/r))  :  H  is  a  continuous  d.f.,  q  >  0}. 

Again,  the  class  is  still  too  broad  to  be  indexed  but  does  place  structure  on  the 
relationship  of  F  to  G  under  the  alternative.  Nonparametric  tests  of  location 
include  the  Wilcoxon,  median  and  normal  scores  (location)  rank  tests.  Nonpara¬ 
metric  tests  of  scale  include  the  Mood  and  normal  scores  (scale)  rank  tests. 

A  general  alternative  leaves  the  way  in  which  F  and  G  are  related  unspecified. 
A  typical  class  in  this  case  might  be 

7  =  {F  is  a  continuous  d.f.;  G  is  a  continuous  d.f.}. 

One  can  see  that  this  is  a  much  broader  class  than  the  location  and  scale  alter¬ 
natives  considered  above.  Minimal  assumptions  are  made  on  the  true  relation  of 
F  to  G  under  Ha. 

The  class  of  alternatives  is  important  for  just  the  same  reasons  as  discussed  in 
Subsection  2.2.3  on  the  relation  of  parametric  to  nonparametric  tests.  The  issues 
are  power  and  specification.  A  test  against  a  specific  alternative  will  usually  be 
more  powerful  than  a  test  against  a  general  alternative  if  the  alternative  which 
actually  holds  falls  in  the  class  considered  by  the  former.  On  the  other  hand,  a 
specific  alternative  can  fail  miserably  if  the  true  alternative  falls  outside  the  class 
for  which  it  is  designed.  For  example,  using  the  techniques  discussed  in  Section 
4,  it  can  be  shown  that  asymptotically  the  Wilcoxon  test  has  power  equal  to  its 
size  if  a  local  scale  alternative  holds  and  the  underlying  distribution  is  symmetric 
about  zero. 

As  with  parametric  tests,  if  one  has  justification  to  use  a  test  designed  against 
a  specific  alternative,  it  should  by  all  means  be  used.  However,  since  the  purpose 
of  this  work  is  to  design  a  methodology  which  makes  minima!  assumptions  on 
F  and  G,  broader  alternatives  than  location  or  scale  must  be  considered.  Tests 
against  such  broader  alternatives  are  called  omnibus  or  portmanteau. 


2.2.4.  Nonparametric  Teats  Against  Specific  Alternatives  (Linear  Rank 
Statistics).  In  this  subsection,  linear  rank  statistics,  which  are  nonparametric 
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testa  against  specific  alternatives,  are  reviewed.  It  has  been  decided  in  Subsection 
2.2.3  not  to  employ  tests  against  specific  alternatives  such  as  rank  tests.  However, 
rank  tests  will  be  seen  to  have  a  role  to  play  and  as  such  merit  discussion.  Rank 
statistics  have  many  equivalent  or  asymptotically  equivalent  representations.  In 
this  subsection,  the  notation  of  Chernoff  and  Savage  (1958)  is  used.  They  define 
a  rank  statistic  having  the  form 

SN  ~  f  JN\HN(z))dFm(z) 

(2.2.1) 

;=i 

where  R,  is  the  rank  of  Xt  in  the  pooled  sample,  Jy  is  known  as  a  score  function, 
unqualified  integrals  are  assumed  to  be  taken  over  the  real  line,  Fm  is  the  sample 
or  empirical  distribution  function  of  the  first  sample, 

fm(l)  =  ^  f;  l(X,  <  X) 

j=  1 

and  Hy  is  the  sample  distribution  function  of  the  pooled  sample, 

i  m  n 

Hn(x)  =  ^  (E  <  *)  +  E  2  *)) 

j= 1  J=1 

=  \N)Fm{x)  +  (1  -  *(y))Gn(x)- 


The  sample  distribution  function  of  the  second  sample  is  Gn[x)  and  = 
m/(m  -f  n)  =  m/N  is  the  fraction  of  the  pooled  sample  represented  by  the  first 
sample.  The  sample  distribution  function  of  the  pooled  sample  estimates  the 
population  quantity  H(x)  =  A^jF(i)  +  C1  -  *(n))g(x)- 

As  might  be  expected,  the  small  sample  distribution  of  Sy  under  H0  or  Ha 
is  not  always  easy  to  determine.  This  is  true  even  of  the  Wilcoxon  statistic.  The 
finding  of  percentage  points  of  the  distribution  of  Sy  is  greatly  simplified  by 
the  celebrated  work  of  Chernoff  and  Savage  (1958).  Using  what  has  come  to  be 
called  a  Chernoff-Savage  approach,  they  demonstrate  the  asymptotic  normality 
of  Sy.  For  simplicity,  assume  that  Jy( u)  =  J(u)  does  not  depend  on  N.  One 
then  has 
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Theorem  2.2.1  (Chernoff  and  Savage,  1958).  If  J( u)  » a  not  constant  and  if 
|j(‘)(u)|  <  Jf|u(l  —  for  i  =  0,1,2  and  some  K  and  6  >  0,  then  for 

fixed  and  continuous  F  and  G,  one  has  Sff  is  AN  (p,o^),  where 

d  =  j  J[J/(x)l<<F(*) 

and 

Na],  =2(1  -  V)){/  f  Cf1)!1  -  G(v)l-/'(ff(i)M'(ff(v)Wf(i)<iF(y) 

+  '-^1  f  j  F(x)[  1  -  F(y)\J'[H(z)}j'lH(y))dG(z)dG(y)} 

X(N)  J  J x<y  > 

providing  Off  ^  0. 

The  notation  Sff  is  AN(/x,ojy)  means  that  the  distribution  function  of  the  ran¬ 
dom  variable  ( Sff  —  p)  joff  converges  pointwise  to  the  distribution  function  of  a 
standard  normal  random  variable.  To  find  approximate  values  of  the  distribu¬ 
tion  function  of  Sff,  one  need  only  calculate  the  values  of  p  and  Off.  In  many 
practical  circumstances,  the  values  of  p  and  ojy  can  be  worked  out.  For  example, 
taking  J(u)  =  u  (Wilcoxon  scores),  under  H0  one  finds 

p  —  J  F(x)dF{x)  =  J  udu  =  1/2 

and 

=  2^^  /  /  F(*)[l  -  F(y))iF(x)<W(y) 

*[N)  J  Jx<y 

=  /[I  -  F[y)]dF(y)  F  F[z)dF(z) 

A(JV)  j  J- 00 

=  ‘  ~  A(lV|  [  F(y)2]l  -  F(y)\dF(y) 

V)  1 

1  1  -  \n) 

12  \n) 

By  the  Chernoff-Savage  theorem,  one  can  conclude  that  the  Wilcoxon  statistic 
is  AN(l/2,  (1  -  A^j)/(12iVA(iy))). 


t 
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Table  2 


Commonly  used  score  functions  for  linear  rank 
statistics,  location  and  scale  alternatives. _ 


Name 

Score  Function 

Wilcoxon 

Location 

u 

Normal 

$_1(u) 

Median 

J(u  <  1/2) 

Mood 

Scale 

(«  -  1/2)2 

Normal 

$_1(tt)2 

Ansari-Bradley 

|u  -  l/2( 

Many  different  forms  of  the  score  function  have  been  proposed,  some  which 
depend  on  N  and  some  which  do  not.  It  is  the  score  function  which  determines 
the  properties  of  the  statistic.  It  is  up  to  the  user  of  these  techniques  to  choose 
the  score  function.  Some  considerations  for  its  choice  are  now  given.  Table  2  gives 
some  commonly  used  score  functions  for  scale  and  location  alternatives.  Notice 
that  the  score  functions  corresponding  to  location  alternatives  are  monotone  and 
those  corresponding  to  scale  alternatives  have  one  sign  change  in  their  derivative. 
If  one  were  to  redefine  the  score  function  as  J  (u)  =  J  (u)  —  n  (thus  centering  the 
statistic),  this  observation  can  be  recast  in  terms  of  zero  crossings.  The  score 
functions  for  location  alternatives  have  one  zero  crossing  in  (0,  l)  and  those  for 
scale  alternatives  have  two.  Eubank,  LaRiccia  and  Rosenstein  (1987)  give  further 
intuition  into  this  matter.  For  now,  it  is  enough  to  notice  such  a  pattern. 

Given  the  relative  freedom  in  the  choice  of  score  function,  one  might  ask  if 
it  is  possible  to  choose  it  optimally  in  some  fashion.  The  answer  is  yes  in  the 
following  sense.  For  location  alternatives,  F(x)  =  H(x)  and  G(x)  =  H(x  -  0), 
the  optimal  score  function  is 


JM  =  - 


h'QH{u) 


hQ^(u)  ’ 

where  h  =  H1  is  the  density  function  of  the  distribution  function  H;  Q^{u)  = 
inf{z:F(x)  >  u}  is  the  quantile  function  of  H  and  hQ*1  { u)  =  h[Qw(u)j  is  a 


composite  function  as  is  h#Q^(u).  The  density,  h,  and  its  derivative,  h1,  are 
assumed  to  exist.  The  score  function  Jl(u)  w  optimal  in  the  sense  that  it  maxi¬ 
mizes  the  asymptotic  relative  efficiency  (ARE)  of  the  test  as  defined  by  Noether 
(see  Randles  and  Wolfe  (1979)  pages  147  ff.].  The  asymptotic  relative  efficiency 
gives  a  measure  of  the  power  of  one  test  relative  to  another.  In  fact,  the  test 
that  results  from  using  Ji  is  asymptotically  relatively  efficient.  This  means  that 
no  test,  not  even  a  parametric  one,  will  produce  a  better  ARE.  The  optimality 
of  Ji  is  quite  strong.  Similarly,  for  scale  alternatives  one  finds  that  the  optimal 


score  function  is 


Applying  these  formulas,  one  sees  that  the  Wilcoxon  is  optimal  at  detecting 
location  shifts  if  H  is  the  logistic  distribution  and  the  normal  scores  tests  are 
optimal  at  detecting  location  and  scale  shifts  in  underlying  normal  populations. 
In  one  sense  it  is  a  drawback  that  to  achieve  the  optimality  one  must  know 
the  underlying  family  of  distributions,  H ,  to  which  F  and  G  belong.  This  is 
not  the  case  if  one  is  merely  interested  in  protecting  best  against  certain  classes 
of  distributions.  For  example,  if  one  thought  that  the  underlying  distribution 
might  be  slightly  longer  tailed  than  the  normal,  the  Wilcoxon  test  is  a  good 
choice.  Even  though  it  is  optimal  only  for  H  logistic  (which  has  slightly  longer 
tails  than  the  normal),  one  should  expect  it  to  perform  well  against  the  broader 
family.  Further,  since  the  test  is  nonparametric  one  is  protected  in  case  F  and 
G  are  strongly  non-logistically  distributed. 

2.2.5.  Nonparametric  Tests  Against  General  Alternatives.  As  has  been 
shown  in  Subsections  2.2.2  and  2.2.3,  the  class  of  nonparametric  tests  against 
general  alternatives  most  nearly  matches  the  goal  of  minimal  assumptions  about 
F  and  G  outlined  in  Section  1.  In  this  subsection,  existing  nonparametric  tests 
against  general  alternatives  are  reviewed.  In  addition  to  standard  properties,  it 
should  be  examined  how  these  tests  fit  into  a  computer  oriented  data  exploratory 
environment.  It  is  seen  that  as  simple  statistics,  they  do  not  fit  well  into  such  an 
environment.  Two  tests  are  examined  in  detail:  the  Cramer-von  Mises  and  the 
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Anderson-Darling.  The  Kolmogorov-Smirnov  test  is  briefly  mentioned,  but  not 
examined  in  detail. 

The  Cram£r-von  Mises  test  has  been  defined  in  a  number  of  ways,  none  quite 
the  same  but  all  having  the  same  spirit.  Lehmann  (1951)  and  Rosenblatt  (1952) 
define  the  Cramer-von  Mises  statistic  as 

J [Fm{x)  ~  <jn(x)]2d(Fm(x)  +Gn(x)]; 

Kiefer  (1959)  and  Fisz  (1960)  employ  the  definition 

™  j '[Fm(*)  -  C„(*)]2<Hfw(x); 

the  Pyke  and  Shorack  (1968)  process  leads  to  the  definition 

NT~x~  f  ^FmQN(u)  ~  u?du’ 

1  “  A(tf )  Jo 

where  Qjy(a)  is  the  sample  quantile  function  of  the  pooled  sample;  and  Parzen’s 
(1983)  definition  of  the  comparison  distribution  function  leads  to  the  definition 

(2.2.2)  CVMn  =  N  -  XW-  f 1[Dn[w )  -  tv]2dw, 

1  ~  A(1V)  Jo 

where  Djf(w)  is  the  sample  distribution  function  of  the  normalized  ranks, 
Rl/N, . . . ,  Rm/N.  This  last  definition  is  the  one  which  is  used  throughout  this 
work.  All  of  these  versions  have  the  same  motivation:  one  is  measuring  the 
distance  of  F  to  G  by  an  integral  of  a  squared  function.  Here  only  (2.2.2)  is 
considered  in  detail. 

The  limiting  distribution  of  (2.2.2)  is  the  same  as  that  of  the  corresponding 
one  sample  statistic  and  is  given  by  Anderson  and  Darling  (1952).  The  limiting 
distribution  depends  on  that  of  the  integrand  which  is  viewed  as  a  continuous 
parameter  stochastic  process.  To  achieve  a  limiting  process  for  the  integrand, 
one  must  make  certain  additional  assumptions  on  F  and  G.  These  are  detailed 
in  Subsection  2.3.3  and  won’t  be  discussed  here  further. 

Durbin  and  Knott  (1972)  give  a  very  important  alternate  representation  for 
the  Cram4r-von  Mises  statistic  in  terms  of  what  they  call  components.  Although 
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they  use  the  one  sample  problem  as  a  format,  their  procedures  apply  to  the  two 
sample  problem  as  well.  Using  their  techniques,  one  finds  that 


where  the 


(2.2.3)  CVMW  =  £ 

3  =  1  J 

where  the 

f-  /■!  I 

ZNj  =  ji rV2  /  sin 

7o  V  1  “ 

are  referred  to  as  the  components  of  the  Cramer-von  Mises  statistic.  The  compo¬ 
nents,  Ztf  j,  Ztf 2»  •  •  • ,  are  asymptotically  independent  with  the  standard  normal 
distribution  under  H0. 

Durbin  and  Knott’s  (1972)  result  is  derived  by  an  orthonormal  expansion 
of  the  random  function  —  A(jv))[^/y(w)  —  u>j.  The  techniques  used 

are  basic  to  Fourier  analysis.  The  equality  (2.2.3)  follows  easily  from  Parseval’s 
identity.  The  distributional  results  are  somewhat  deeper  in  nature  and  a  good 
discussion  is  given  in  Shorack  and  Wellner  (1986),  pages  215  ff. 

Anderson  and  Darling  (1952)  suggest  an  alternative  statistic,  ADjy,  which 
can  be  written  as 

AD*  =  z - -  w]2/w(l  -  w)dw. 

1  ~  A(JV)  Jo 

The  rationale  for  the  extra  term,  w(l  -  w),  is  to  give  each  point  of  the  process 
)[£>#(«/)  —  u;]  equal  weight  in  a  statistical  sense.  This  follows 
from  the  fact  that  under  H0  its  limiting  process,  call  it  CDjy(tn),  has  variance 
tu(l  —  tu).  Anderson  and  Darling  (1952)  determine  the  distribution  of  ADjy 
under  the  null  hypothesis.  Durbin  and  Knott  show  that  this  statistic,  too,  has  a 
representation  in  terms  of  components.  This  representation  is 


AD"  =  £7(7TTj^- 


where 
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with 


LPj{w)  -  2\/(2j  +  l)tv(l  -  w)L'j{2w  -  1), 


and  Lj  is  the  jth  Legendre  polynomial.  Again,  under  H0,  the  components  are 
asymptotically  distributed  as  independent  standard  normal  random  variables. 

The  asymptotic  representations  for  CVMjy  and  AD jy  allow  important  in¬ 
terpretations  of  the  manner  in  which  these  statistics  operate.  It  is  well  known 
that  these  tests  are  consistent  against  any  alternative  [see  Randles  and  Wolfe 
(1979),  page  384],  yet  one  must  certainly  be  sensitive  to  the  issue  of  power  as 
well.  Consider  the  first  component,  of  the  Cram6r-von  Mises  statistic.  It 
can  be  written  as 


Zjh  —  aN  I  sin(*ru>)[.Djv(ti/)  -  w\d\ 

Jo 

=  bff  l  cos(xu>)d[.Djy(u/)  -  tv] 

Jo 


m 


=  bNY^coa{irRj/N), 
;=1 


where  ajy  and  are  constants  depending  only  on  N.  This  component  has  the 
form  of  a  linear  rank  statistic  with  score  function  J\  (u)  =  cos(xu).  Since  Jj(u) 
has  but  one  zero  crossing  in  (0, 1),  the  first  component  is  a  test  against  a  location 
shift.  It  can  be  shown  that  Ji(u)  is  the  optimal  score  function  for  detecting  shifts 
in  the  Cauchy  family.  In  the  same  manner,  the  first  component  of  the  Anderson- 
Darling  statistic  can  be  shown  to  be  the  Wilcoxon  statistic.  Similarly,  the  second 
component  of  CVMjy  is  a  rank  statistic  with  score  function  J2(u)  =  cos(2iru) 
which  is  a  score  for  scale  alternatives.  The  process  continues  with  score  functions 
of  successively  higher  frequency. 

The  interpretation  of  the  components  as  rank  statistics  is  very  important. 
Both  C VM jy  and  AD jy  successively  and  rapidly  downweight  these  rank  statistics 
in  calculating  an  overall  portmanteau  statistic.  Although  they  may  be  consistent 
against  general  alternatives,  one  expects  that  they  would  have  poor  power  char¬ 
acteristics  against  any  but  the  first  few  components.  This  is  indeed  the  case,  as 
is  seen  in  Section  4. 
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There  does  exist  a  third  statistic,  the  Kolmogorov-Smirnov  statistic,  which 
is  consistent  against  general  alternatives.  It  is  defined  as 

KSjv  =  J - - -  sup  | Dn(w)  -  w\, 

V  1  ~  A(JV)  0<ui<1 

which  measures  the  deviations  from  uniformity  in  a  supremum  norm  sense.  It 
tests  H0  only  by  the  largest  deviation  of  Djg(w)  from  to.  The  Kolmogorov- 
Smirnov  test  does  not  have  a  representation  in  terms  of  components,  which  makes 
its  response  to  various  alternatives  much  more  difficult  to  gauge. 

As  a  final  point  concerning  nonparametric  tests  against  general  alternatives, 
consider  how  these  statistics  relate  to  the  criteria  outlined  in  Section  1.  As 
nonparametric  tests  against  general  alternatives,  they  certainly  make  minimal 
assumptions  on  F  and  G.  There  are  certain  troubling  questions  about  their 
power.  It  is  also  difficult  to  see  how  they  fit  into  a  graphical,  exploratory  mode 
of  data  analysis.  As  simple  statistical  tests,  they  simply  accept  or  reject.  There 
is  no  explanation  as  to  why  H0  is  rejected  should  it  be. 

2.2.6.  Criteria  for  a  Methodology.  Having  reviewed  existing  two  sample 
techniques  and  armed  with  an  outline  of  goals  from  Section  1,  a  list  of  criteria 
for  a  methodology  can  now  be  given.  The  list  gives  the  desirable  properties  a 
procedure  should  possess  in  order  to  attain  the  goals  given  in  Section  1.  These 
criteria  result  directly  from  a  combination  of  these  goals  and  observations  made 
concerning  existing  techniques  in  this  section. 

The  criteria  for  a  two  sample  procedure  which  must  be  met  by  any  derived 
in  this  work  are: 

1.  It  is  not  solely  number  oriented  but  does  possess  graphical  features. 

2.  It  is  not  only  a  statistical  test  but  also  a  selection  procedure  for  a  model  of 

the  relation  of  F  to  G. 

3.  It  should  be  omnibus. 

4.  It  should  be  nonparametric  distribution  free  under  the  null  hypothesis. 

A  procedure  with  strong  graphical  elements  is  desired  to  take  advantage  of 
modern  computing  environments.  Statistics  are  needed,  to  be  sure.  However, 
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numbers  by  themselves  cannot  convey  the  quantity  or  diversity  of  information 
of  which  graphs  are  capable.  Graphs  have  shape;  numbers  do  not.  Statistics  are 
useful  as  diagnostics  and  indicators;  however,  there  is  no  longer  any  need  to  rely 
upon  them  exclusively. 

Dual  to  statistical  testing  is  model  selection.  Suppose  a  model,  M,  for  a 
process  is  to  be  chosen  from  the  class  of  possible  models,  M.  The  model  gives, 
in  some  fashion,  the  true  behavior  of  the  process.  In  the  two  sample  case,  it 
would  give  the  true  relation  of  F  to  G.  Suppose  further  that  the  null  hypothesis 
corresponds  to  some  subset,  A,  of  M.  Choosing  a  model  is  dual  to  testing  the 
null  hypothesis  in  that  the  null  hypothesis  is  rejected  if  and  only  if  the  chosen 
model  is  not  in  A.  Similarly,  a  test  of  H0  can  be  viewed  as  a  model  selection 
process  if  by  rejecting  H0  an  element  of  M  not  in  A  is  selected.  Any  procedure 
derived  must  explicitly  represent  this  duality. 

As  stated  in  Section  1,  it  is  desired  to  make  minimal  assumptions  on  F 
and  G  either  under  H0  or  Ha.  In  term  of  statistical  terminology,  it  has  been 
seen  in  this  section  that  this  desire  translates  into  a  nonparametric  test  against 
general  alternatives.  The  wish  to  have  a  test  consistent  against  any  alternative 
is  tempered  by  the  desire  for  a  test  which  has  good  power  characteristics  against 
a  wide  range  of  alternatives.  It  will  not  be  required  that  a  test  be  consistent 
against  any  alternative,  but  that  it  be  consistent  for  a  wide  range  of  alternatives 
or  be  omnibus.  The  procedure  should  also  be  nonparametric  distribution  free 
under  H0  so  that  the  distributional  problems  possess  a  solution. 

2.3.  Review  of  the  Comparison  Density 

2.3.1.  Introduction.  There  is  an  object  which  lends  itself  to  the  sort  of  graph¬ 
ical,  functional  type  of  portmanteau  test  which  was  outlined  in  Subsection  2.2.6. 
This  object  is  termed  the  comparison  density  by  Parzen  (1983).  The  estimation 
of  and  tests  based  on  the  comparison  density  will  form  the  foundation  of  this 
dissertation.  In  the  following  subsections,  the  comparison  density  is  defined,  the 
properties  of  related  stochastic  processes  are  discussed,  and  existing  tests  based 
on  it  are  reviewed. 
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2.3.2.  Definition  of  the  Comparison  Density.  In  this  subsection,  the  com¬ 
parison  density  is  defined  and  its  properties  reviewed.  Let  aj  and  ag  be  the  lower 
endpoints  of  /  and  g ,  respectively,  so  that  af,ag  >  -oo.  Similarly,  let  bj  and  bg 
be  the  upper  endpoints  of  /  and  g ,  respectively,  so  that  bf,bg  <  oo.  From  the 
outset,  assume  that  the  distribution  functions,  F  and  G,  the  quantile  functions, 
and  QG,  and  the  densities,  /  and  g ,  satisfy 

a.  Qf  and  QG  are  continuous; 

(2.3.1)  b.  F  and  G  are  absolutely  continuous  with  densities  /  and  j; 

c.  /  and  g  are  both  continuous  on  the  interval  [aj  A  ag,bf  V  bg). 

Parzen  (1983)  defines  the  comparison  distribution  function  to  be, 

D\(vu)  —  FQf  (w)  =  F  o  Qx(w)  for  0  <  tv  <  1, 

where  (w)  is  the  quantile  function  of  H\(x)  =  A F(x)  +  (1  —  A)G(z)  and 
o  means  function  composition.  A  few  of  the  simple  properties  of  Dy  are:  (a) 
D^(O)  =  0;  (b)  D^(l)  =  1;  (c)  Dx  is  increasing  on  [0,1];  and  (d)  D\  is  absolutely 
continuous  on  [0, 1].  These  properties  justify  the  term  ‘distribution’  as  D\  is,  in 
fact,  a  distribution  function. 

The  comparison  density  is  just  the  derivative  of  the  comparison  distribution 
function 

(2.3.2)  AM  =  £*<•>  - 

since  Q\{ w)  =  [see  Parzen  (1979)].  Note  that  condition  (2.3.1c) 

ensures  that  d*(u>)  is  continuous  on  (0, 1).  The  continuity  of  d^(u>)  is  needed  to 
show  the  weak  convergence  of  the  comparison  distribution  process  (see  Subsection 
2.3.3).  The  condition  (2.3.1c)  allows  for  many  possible  F  and  G,  but  some 
choices  are  excluded.  For  example,  taking  F  as  the  N(0,1)  distribution  and 
G  as  the  standard  lognormal  does  satisfy  this  condition  since  G  is  continuous 
on  IR.  Taking  F  as  the  N(0,1)  distribution  and  G  as  the  standard  exponential 
distribution  does  not  satisfy  condition  (2.3.1c)  since  G  is  discontinuous  at  0.  The 
comparison  density  will  also  be  discontinuous.  Parzen  (1983)  gives  several  of  the 
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elementary  properties  of  d\  as,  (a)  0  <  <fx(to)  <  1/A;  (b)  dx(to)  —*  0  if  /  — ♦  0; 
and  (c)  d*(ti>)  — *  1/A  if  g  — *  0. 

The  most  important  interpretation  of  d\  is  that  of  a  likelihood  ratio.  The 
comparison  density  is  the  likelihood  ratio  of  the  density  of  the  first  sample  to  the 
density  of  the  pooled  sample,  /  to  hx,  evaluated  at  the  quantile  function  of  the 
pooled  sample,  Qf.  Now  if  F  =  G,  hx{ x)  =  f(x)  for  all  x.  If  F  #  G,  then  /  will 
differ  from  g  on  at  least  an  interval  (since  both  are  continuous).  Consequently, 
F  =  G  if  and  only  if  dx( w)  =  1  for  0  <  tv  <  1.  Thus  the  hypothesis  H0:  F  =  G 
is  equivalent  to 


H0:dx[ to)  =  1  for  0  <  to  <  1. 


Furthermore,  if  the  alternative  F  ^  G  holds  then  dx  specifies  the  way  in  which  the 
hypothesis  fails.  This  specification  is  given  by  departures  of  dx  from  uniformity. 
It  is  also  possible  to  specify  these  departures  in  terms  of  the  usual  likelihood 
ratio  of  /  to  g  by  noting,  as  Parzen  (1983)  does,  that 


1 

dxH 


=  A  +  (1  — •  A) 


9Qx  M 
fQx(w) 


However,  there  is  really  no  need  to  go  to  this  trouble  unless  an  estimate  of  f  /g  is 
specifically  desired.  Visually,  it  is  enough  to  know  that  d^(u>)  >  1  if  and  only  if 
fQxiw)  >  9Qx(w)-  A  further  argument  for  interest  in  dx  instead  of  f  /g  is  that 
dx  is  bounded  between  0  and  1/A  whereas  f /g  will  often  be  unbounded.  The 
estimation  of  unbounded  functions  is  a  significantly  more  difficult  task  which  is 
best  avoided,  if  possible. 

Given  a  plot  of  d*,  one  might  wonder  if  one  can  determine  the  kinds  of  F 
and  G  that  generated  it.  Figure  1  presents  dx  for  a  variety  of  F  and  G.  Figures 
(a),  (c),  (d)  and  (f)  are  location  alternatives,  that  is,  G(x)  =  F(x  -  0),  for  some 
constant  0.  Figure  (b)  is  a  scale  alternative,  so  G(x)  =  F(x/9).  Although  one 
might  be  tempted  to  classify  Figure  (e)  as  a  scale  alternative  based  on  the  fact 
that  G(x)  =  F(x/2),  it  is  best  characterized  as  a  location  alternative.  This  is 
so  since  it  is  easily  converted  to  a  location  alternative  by  taking  logarithms  of 
the  random  variables.  This  interpretation  is  doubly  pleasing  since  Figure  (e) 


ing  F  =  N(0,  l)  and  G  =  N(l,  1);  Figure  (b),  F  =  N(0, 1)  and  G  =  N(0,4);  Figure 
(c),  F  =  Cauchy(0, 1)  and  G  =  Cauchy(l,l);  Figure  (rf),  F  =  Triangular(0, 1) 
and  G  =  Triangular(l,  1);  Figure  (e),  F  =  Exp(l)  and  G  =  Exp(l/2);  Figure 
(/),  F  =  Lognormal (0, 1)  and  G  =  Lognormal(l,  1);  Figure  (j),  F  =  Weibull(3) 
and  G  =  Exp(l);  and  Figure  ( h ),  F  =  N(0, 1)  and  G  =  N(l,l).  Figures  (a) 
through  (g)  are  constructed  with  A  =  1/2  and  Figure  (h)  is  constructed  using 


A  =  1/4. 
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appears  similar  to  Figure  (a).  One  could  define  a  pure  location  alternative  as 
those  alternatives  for  which  d\  is  monotone  and  a  pure  scale  alternative  as  those 
for  which  d\  has  one  sign  change  of  its  derivative.  However,  Figures  (c)  and  (f) 
fail  such  a  criterion.  It  would  seem  preferable  to  classify  as  a  location  alternative 
those  dx  for  which  the  dominant  term  in  an  orthonormal  expansion  of  d\  is  the 
lowest  frequency.  Scale  alternatives  would  have  as  a  dominant  term  the  next 
higher  frequency.  Figures  (b)  and  (g)  are  quite  similar;  yet  the  first  is  a  scale 
alternative  and  the  second  a  general  alternative.  One  could  argue  that  in  the  case 
of  Figure  (g)  that  the  dominant  difference  between  the  two  is  scale.  In  Section 
3,  diagnostics  will  be  introduced  which  help  indicate  the  types  of  relations  of  F 
to  G  present. 

The  comparison  density  is  a  specific  example  of  the  general  technique  of 
reducing  the  null  hypothesis  to  a  test  of  uniformity  of  a  function  defined  on  [0, 1]. 
The  idea  is  well  established  in  terms  of  the  one  sample  problem  in  which  the  null 
hypothesis  is  completely  specified.  In  the  one  sample  location-scale  problem, 
Parzen  (1979)  introduces  the  more  general  approach  taken  here.  He  terms  this 
approach  the  density  estimation  approach  to  goodness  of  fit.  In  his  comments  on 
the  article,  Lindley  remarks  that  this  approach  “provides  something  which  looks 
as  if  it  will  be  easier  to  handle  than  the  raw  functions.” 

The  comparison  density  is  seen  as  a  starting  point  to  fulfilling  the  criteria  for 
testing  outlined  in  Subsection  2.2.6.  The  comparison  density,  being  a  function, 
is  graphical  in  nature  and  conveys  information  as  to  the  relation  of  /  to  g.  The 
goal  is  to  estimate  the  comparison  density,  in  a  manner  to  be  determined,  and 
to  test  that  estimate  for  uniformity.  If  the  test  rejects,  a  graph  of  the  estimate 
is  displayed  and  various  diagnostics  presented. 

2.3.3.  The  Comparison  Distribution  Empirical  Process.  It  has  been  seen 
that  the  comparison  density,  d^,  is  a  useful  and  interesting  object.  It  is  desired 
both  to  estimate  the  comparison  density  and  to  derive  inferential  procedures 
concerning  its  uniformity.  As  pointed  out  in  Subsection  2.2.6,  however,  these 
two  goals  are  dual  in  nature,  each  depending  on  the  other. 

As  a  practical  matter,  one  needs  to  determine  a  stochastic  process  such 


26 


that  an  estimator  of  d\  can  be  written  as  a  functional  of  this  process.  In  this 
subsection,  such  a  stochastic  process  is  discussed.  This  stochastic  process  is 
termed  the  comparison  distribution  empirical  process  and  is  introduced  by  Parzen 
(1983)  as  a  unifying  concept.  It  will  be  seen  to  be  a  unification  in  that  linear  rank 
statistics,  the  Cramer-von  Mises,  Anderson-Darling  and  Kolmogorov-Smirnov 
statistics  can  all  be  conveniently  represented  as  functionals  of  this  process. 

As  a  means  of  motivating  this  approach,  consider  estimating  the  mean,  /i, 
of  the  distribution,  F,  from  the  first  sample,  X\,. . .  ,Xm.  Suppose  that  F  has 
finite  variance  a2.  The  sample  mean,  X,  is  given  by  X  =  f  xdFm(x),  where  Fm 
is  the  empirical  distribution  function.  One  is  led  naturally  to  the  statistic 

y/m[X  -  (i)  =  y/m  j  xi[Fm(x)  -  F(x)| 

=  flQr{t)i\FmQF(t)-t\ 

Jo 

=  f  QF(t)dUm(t ) 

Jo 

=  [gF(t)Vm(t)dt, 

Jo 

where  qF (t)  =  j-tQF{t).  The  quantity  Um(t)  is  termed  the  uniform  empirical 
process  [see  Shorack  and  Wellner  (1986),  page  86]  and  it  is  well  known  [Shorack 
and  Wellner,  page  110]  that  Um(t)  =»  U(t)  as  m  -*  oo,  where  =>  denotes  weak 
convergence.  Here,  U(t)  is  a  Brownian  bridge;  it  can  be  characterized  as  U(t)  = 
W(t)  -  tW(l),  where  W(t)  is  a  Wiener  process.  The  Wiener  process  is  defined  as 
a  process  that  satisfies  W(0)  =0  a.s.,  W(t)  ~  N(0,t2),  and  W(t)  has  stationary, 
independent  increments.  Hence,  the  Brownian  bridge  is  a  Gaussian  process  with 
mean  function  E[If(t)j  =  0,  covariance  kernel  K(s,t)  =  E[I7(t)C/(s)]  =min(a,t)~ 
st,  and  (7(0)  =  U(  1)  =  0  a.s..  See  Billingsley  (1986),  page  522,  or  Shorack  and 
Wellner  (1986)  for  more  details. 

Given  the  weak  convergence  of  Um  to  U,  one  would  hope  that  it  would  follow 


that 


/'  <ir(t)Um(t)dt  A  /‘  qF(t)U(t)dt  M  m 
Jo  Jo 


00. 


This  last  integral  is  normally  distributed  with  mean 

E  [j\F{t)U(t)dt]  =  j\F(t)E\U(t))dt  =  0, 

and  variance 

Var[ j\F{t)U(t)dt\  =  J^QF{t)2dt-  [jf*  QF{t)dt\2 

=  J  x 2f(x)dx-  [  J  x/(z)dx] 


[See  Parzen  (1962b),  page  77,  amd  Parzen  (1979)).  It  has  just  been  proved,  albeit 
somewhat  heuristically,  that  X  is  AN(/x, o2/m).  The  key  is  to  define  an  empirical 
process  such  that  the  parameter  one  is  interested  in  can  be  written  as  a  functional 
of  that  process.  For  this  example,  one  need  only  apply  the  central  limit  theorem 
to  achieve  the  result.  The  case  of  estimating  d\  is  far  less  straightforward  in  that 
one  is  not  estimating  a  single  parameter  from  a  random  sample.  The  stochastic 
process  approach  is  the  only  one  available. 

The  parameter  A  represents  the  weight  or  proportion  given  the  distribution 
F.  For  the  samples  Xi,...,Xm  and  Yj, . . .  ,Yn,  the  natural  estimate  of  A  is 
A(jV)  =  m/N>  where  N  =  m  +  n.  It  is  assumed  throughout  that  A|jy)  — ►  Aq 
as  m  A  n  — ►  oo,  where  0  <  Aq  <  1.  Define  D^(w)  =  D\(N)(w),  d^j(u;)  = 
d\[N){w ),  A)(ty)  =  D\q(w),  and  do(w)  =  dA0(u;).  These  first  two  functions 
depend  on  N ,  but  only  through  the  parameter  Ajjyj,  amd  so  are  not  random.  In 
fact,  the  goal  is  the  estimation  of  d(jy j  which  is  tantamount  to  estimating  do  as 
m  A  n  —*  oo. 

The  obvious  process  to  define  for  the  comparison  distribution  function  is 
(2.3.3)  Lff{w)  =  '^\Kn{'w)  ~  ^(A)^))* 

where  Kpf(w)  =  FmQ^(w),  and  Q$  is  the  sample  quamtile  function  of  the  pooled 
sample.  The  process  (2.3.3)  simply  substitutes  the  empirical  functions  for  the 
unknowns,  which  is  the  manner  in  which  Um,  above,  was  constructed.  Let  X|i) 
be  the  itft  order  statistic  of  Xj, . . . ,  Xm,  for  *  =  1, . . . ,  m.  Let  be  the  rank  of 
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Fig.  2.  The  typical  appearance  of  and  Djy.  Figure  (a)  is  Kjf  and  ( 6 )  is 
Dtf.  The  two  graphs  are  constructed  using  the  same  input  ranks  and  m  =  10 
and  n  =  10.  The  block  on  the  end  of  the  line  segments  marks  the  value  of  the 
function  at  the  jump  points. 

Xx  in  the  pooled  sample  and  let  be  the  rank  of  in  the  pooled  sample. 
This  latter  notation  is  appropriate  since  Jfyj  is  also  the  ith  order  statistic  of 
Ri, , . . ,  Rm.  Now  Kpf  is  given  by 

{0,  for  0  <  u;  <  {R[i)  -  1)/N; 
i/m,  for  (ify)  -  1  )/N  <xv  <  (#(;+i)  -  1  )/N  and  j  =  1, . . . ,  m; 
1,  for  -  1)/N  <  w  <  1. 

Figure  2(a)  gives  the  typical  appearance  of  Kff. 

Pyke  and  Shorack  (1968)  study  the  process  Ljy(tw)  extensively.  Their  main 
result  is  that  under  conditions  (2.3.1),  Ltf(w)  =>  L(w)  asmAn-*  oo,  where 

(2.3.4)  L{w)  =  (1  -  A0)  (■#  (w)U[D0{w)\/ 

and  U ,  V  are  independent  Brownian  bridges  and  Dq  and  d q  satisfy  XqDq(w)  + 
(1  —  Aq).Do(u>)  =  w  and  Aq dq  +  (1  —  Ao)do  =  !•  The  process  L(w)  is  Gaussian 
with  mean  function  0  and  covariance  kernel  K(u,v)  =  E[L(u)L(v)],  equal  to 

(2.3.5)  *(«,.,)  =  (1  -  A0)2  (,#(..)<#  (v)AjM(I  -  CoMl/Ao 

+  rf0(x)<i0(»)C?(U)[l  -  D§ (»))/(!  -  A0)), 
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for  u  <  v.  If  F  =  G,  K (ti,t>)  and  L(w)  simplify  tremendously  and  one  finds  that 
L(w)  =  (1  -  X0)[U(w) / y/%o  -  V (w) / y/1  -  Aq]  and  K(u,v)  =  (l  -  Xq)u(1  -  v) / X0 
for  u  <  v. 

Parzen  (1983)  chooses  to  define  a  slightly  different  process,  CD^(u;),  as 
CDN(w)  =  y/N[DN{w)  - 

where  Dtf(w)  =  \H ^Q^n]~^{xv)\  Hff  is  the  empirical  distribution  function  of  the 
pooled  sample;  Q £  is  the  empirical  quantile  function  of  the  first  sample;  and 
the  exponent  —1  refers  to  a  special  type  of  inverse  which  is  given  below.  The 
function  HffQ^(u)  has  values, 

f  °,  for  u  =  0; 

HnQ„i  «)  |  for  (j  -  l)/m  <  u  <  j/m  and  j  =  1,. . .  ,m; 

since  Qm  =  X(j)  ^  (j  -  l)/m  <  u  <  j/m.  Notice  that  H^Qm  *8  defined  on  0  to 
1,  is  non-decreasing  and  left  continuous.  These  are  the  characteristic  properties 
of  a  quantile  function.  Its  inverse  is  defined  as  Dpf(w)  =  sup{u  :  HffQ^(u)  <  ti>} 
and  results  in 

{0,  for  0  <  u;  <  R^/N] 
j/m,  for  R(j)/N  <  w  <  R[j+i)/N  and  j  =  1, . . . ,  m  -  1; 

1,  for  R[m}/N  <  w  <  1. 

As  a  matter  of  notation,  Djf  is  a  stochastic  process;  it  is  estimating  which  is 
not  random.  If  there  are  parentheses  around  the  N  in  a  subscript,  that  quantity 
is  not  random.  If  there  are  no  parentheses,  the  quantity  is  random.  Note  that 
Dff(w)  is  non-decreasing,  right  continuous,  and  (0)  =  0  and  Dpf(l)  =  1. 
These  are  the  characteristic  properties  of  a  distribution  function  on  [0, 1].  Figure 
2(b)  presents  the  function  Dff  for  the  same  data  as  is  used  to  construct  Ljy  in 
Figure  2(a).  Aly,  Csorgo,  and  Horv4th  (1987)  use  embedding  techniques  to  prove 
that  CDff(tv)  converges  weakly  to  the  process  L[w).  Parzen  (1983)  conjectures 
that  this  result  can  be  obtained  from  Pyke  and  Shorack’s  (1968)  results  since 
Dff(w)  and  Ktf(xv)  differ  by  l/m  on  m  intervals  of  length  l/N.  A  proof  is  as 
follows.  The  function  Dpf(w)  can  be  written  as  Dpf{w)  =  L^(tv)  +  A#(u/), 
where 
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&N(W)  -  1 1/m’  for  ^0)  ~  <w<  R{j)/N  and  j  =  1, . . .  ,m; 

1 0,  otherwise. 

Pyke  and  Shorack  (1968)  give  the  following  representation  for  Lp{(w): 

(2.3.6)  L„(w)  =  (1  - 

-  ^jv(«')v»|g<?w('£’))/ v/i  -  A(«j)  +  %(“>) 

where  AN(w )  =  [Dy(uw)  -  Dpf(w)\/(uw  -  u)  and  tiw  =  HQ^(w)\  Btf(w) 
and  Sff  are  defined  by  X^Aff(w)  +  (1  —  X^)B/f(tv)  =  1  and  Sjy(w)  = 
Aflf(w)y/N(HpfQff(w)  —  to);  Um  and  Vn  are  the  uniform  empirical  processes 
for  the  first  and  second  sample,  respectively.  Pyke  and  Shorack  use  the  Sko- 
rohod  device  [see  Shorack  and  Wellner  (1986),  page  54]  to  create  versions,  Um, 
Vn ,  U,  and  V  of  Um,  Vn,  U,  and  V  such  that  the  new  versions  are  distributed 
identically  as  the  original  versions,  are  defined  on  a  common  probability  space, 
and  satisfy 

||(7m-i7||-0a.s., 

II —  V||  —►0  a.s., 

as  m  A  n  — »  oo,  where  ||  •  ||  is  the  sup-norm.  If  one  can  show  convergence 
in  probability  for  the  new  processes,  this  would  imply  that  their  probability 
measures  also  converge.  Since  these  probability  measures  are  identical  to  those 
of  the  original  processes,  this  must  mean  that  they  converge  also  and  hence  that 
one  has  weak  convergence.  These  new  versions  are  substituted  into  equations 
(2.3.4)  and  (2.3.6)  to  obtain  L(u>)  and  Ljy(u>),  respectively.  They  then  show 
that 

||Zjv(u/)  -  Z(u»)||  — >p  0  as  m  A  n  — ►  oo. 

Write  j}/y(u>)  =  Lff(w)  +  Ajy( w)  so  that 

HDjv(w)  -  L(w) ||  =  || Ln{w)  +  AN[w)  -  L(u/)|| 

<  II^JvH  -  £(»)«  +  ||A(u>)|| 

— >p  0  as  m  A  n  — ►  oo. 
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Hence,  the  same  proof  works  with  little  additional  work.  It  is  important  to 
establish  this  fact  as  weak  convergence  under  local  alternatives  will  be  shown  for 
Lff{w)  in  Section  4  and  the  result  carried  over  to  Dff(w). 

The  stochastic  process  suggested  by  Parzen  will  be  the  basic  process  used 
in  this  work.  It  is  chosen  over  that  of  Pyke  and  Shorack  for  several  reasons. 
First,  Djf(w)  has  the  form  of  the  sample  distribution  function  constructed  from 
the  data  R\/N, . .  .,Rm/N.  This  sample  distribution  function  is  estimating  the 
true  distribution  function  whose  density  it  is  desired  to  estimate.  The  analogy 
with  the  ordinary  density  estimation  is  very  strong  as  shall  be  seen  in  Subsection 
2.4.  In  this  case,  one  views  R\/N,. . .  ,Rfn/N  as  data  arising  from  the  density 
d^ffj  and  uses  conventional  density  estimators  for  d^y  As  with  the  example  of 
the  sample  mean,  these  estimators  can  be  written  as  functionals  of  a  stochastic 
process.  The  limiting  distribution  of  these  functionals  differs  from  the  usual  case 
because  the  limiting  distribution  of  the  underlying  stochastic  process  is  different. 
Second,  Dy{w)  is  preferred  because  rank  statistics  are  more  easily  represented 
as  functionals  of  Dff(w).  Recall  the  rank  statistic  S  —  Y^jLi  J{R%/N)  38  defined 
by  (2.2.1).  This  statistic  can  be  neatly  rewritten  as 

Sn  =  f  J{w)dDN{w). 

Jo 

In  fact,  one  can  rewrite  the  centered  form  of  the  statistic  as 

S*  =  f1  J(w)dCDN{w) 

Jo 

=  f  J,(w)CDjf{'w)dw. 

Jo 

The  asymptotic  normality  of  5jy  is  almost  trivial  if  J  is  differentiable  on  (0, 1) 
and  the  derivative  is  bounded  there.  In  this  case,  the  functional  K(f)  = 
Jq  J'(w)f(w)dw  is  uniformly  continuous  for  /  €  Z?[0, 1],  the  set  of  all  functions  on 
[0, 1]  which  have  limits  from  the  left  and  are  continuous  from  the  right.  By  Theo¬ 
rem  3.12  of  Ruymgaart  (1988),  Sjy  converges  in  distribution  to  fj  J'(w)L(w)dw, 
which  is  a  Gaussian  random  variable.  Aly  et  al.  (1987)  study  more  general  score 
functions  which  are  allowed  to  depend  on  N.  The  representation  for  rank  statis- 
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tics  used  by  Pyke  and  Shorack  (1968)  is 

Sff  =  f  Ljy(tv)dv  r, 

Jo 

where  ujf  is  a  signed  measure  which  puts  measure  e#,  on  the  point  t/N  for 
*  =  1  , ...,jV.  This  representation  is  somewhat  cumbersome  and  less  natural 
than  that  derived  for  the  CDjy  process. 


2.3.4.  Existing  Work  on  the  Estimation  of  the  Comparison  Density.  In 
this  subsection,  existing  work  on  estimating  and  testing  the  uniformity  of  the 
comparison  density  is  reviewed.  Some  very  interesting  work  has  been  done  in 
the  area,  but  it  is  seen  that  these  techniques  fall  short  of  the  goals  which  have 
been  outlined. 

Parzen  (1983)  derives  a  testing  and  estimation  methodology  which  fits  into 
the  framework  outlined  in  Subsection  2.2.6.  His  approach  is  essentially  the  same 
that  will  be  taken  here:  to  apply  a  general  method  for  the  estimation  of  den¬ 
sities  to  the  special  stochastic  process  CD^y.  Parzen ’s  estimator  is  known  as 
the  autoregressive  estimator  and  its  use  in  the  general  density  estimation  set¬ 
ting  is  detailed  in  Parzen  (1979)  and  Carmichael  (1984).  It  is  also  discussed  in 
Subsection  2.4.5. 

A 

The  estimate  of  the  comparison  density,  dfc,  is  defined  by 


dk(»)=°2m  ll  +  E^2^ 

y=i 


-2 


where  the  ay’s  are  complex-valued  and  |  •  |2  denotes  the  complex  squared  mod¬ 
ulus.  The  parameter,  k,  is  a  smoothing  parameter  and  is  referred  to  as  the  order 
of  the  autoregressive  process.  Larger  values  of  k  lead  to  rougher  estimates.  The 
ay’s  and  are  estimated  from  the  data,  Ri/N, . . . ,  Rm/N.  The  form  of  djfc(tn)  is 
that  of  the  spectral  density  [see  Newton  (1988)]  associated  with  a  complex-valued 
AR(lk)  process  with  coefficients,  ai, . . . ,  a jt,  and  normalized  residual  variance, 
hence  the  name. 

Parzen  (1983)  defines  the  pseudo-correlations  to  be 

Pk=  I'  t2nkxdDN{x),  for*  =  0,1,2,...,. 

Jo 
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The  estimates  of  ay,  j  =  1, . . . ,  k  and  o%  are  dy,  j  =  1, . . . ,  k  and  d|,  respectively. 
They  may  be  obtained  utilizing  a  complex-valued  version  of  Levinson’s  algorithm 
(see  Parzen  (1983)  for  a  description  of  the  algorithm)  based  on  the  pseudo¬ 
correlations.  Parzen  suggests  choosing  k  by  a  version  of  Akaike’s  (1974)  AIC 
criterion,  which  chooses  k  to  minimize 


AIC(fc)  =  lno2k  + 


2k  1  ~  \n) 

N  *(A) 


if  that  value  of  AIC  is  negative  and  selects  k  to  be  zero  otherwise.  The  selection 
of  k  =  0  is  significant  because  d(w)  =  1  if  k  =  0.  The  AIC  selection  criterion 
can  then  also  be  viewed  as  a  test  of  the  null  hypothesis,  H0:d(jv)(u;)  =  1.  If 
AIC  chooses  k  =  0,  the  null  hypothesis  is  accepted.  If  AIC  chooees  k  >  0,  the 
null  hypothesis  is  rejected  and  a  model  is  chosen.  This  simultaneous  testing 
and  model  selection  is  exactly  what  is  being  sought  here.  However,  neither  the 
autoregressive  estimator  nor  AIC  will  be  used  in  this  dissertation.  The  procedure, 
however,  stands  as  a  benchmark  to  which  to  compare  any  new  procedures. 

There  are  several  difficulties  with  this  procedure  both  as  a  test  of  Hc  and  as 
an  estimator.  As  a  test,  the  properties  of  AIC  in  this  framework  are  unknown. 
In  particular,  the  size  of  this  test  (the  probability  of  rejection  if  Hc  is  true)  is 
unknown.  Nor  are  there  any  provisions  for  adjusting  the  size  to  a  pre-specified 
level.  These  two  are  not  damning  criticisms.  Although  one  would  probably 
not  expect  to  solve  them  analytically,  they  certainly  would  yield  to  simulation 

A 

techniques.  Of  much  more  concern  is  the  behavior  of  dk  as  an  estimator.  It  can 

A 

be  shown  that  d k  satisfies  the  relation 


dk(0)  =  <£jfc(l). 


Such  a  condition  is  referred  to  as  a  periodicity  condition.  There  is  no  reason 
to  suppose  that  dx{u)  satisfies  such  a  condition.  If  it  does  not,  the  estimator  is 
biased  and  inconsistent  at  the  ends.  It  will  be  seen  in  Subsection  2.4  that  such 
biases  can  reduce  the  efficiency  of  an  estimator  drastically. 

Eubank,  LaRiccia,  and  Rosenstein  (1987)  investigate  what  they  term  the 
components  of  Pearson’s  phi-squared  distance  measure.  Pearson’s  phi-squared, 
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as  defined  by  Eubank  et  al.  (1987),  is 

=  fQ  {*{N)(W)  ~  rfdw, 

which  is  the  squared  ^[O,  l]  norm  between  d ^  and  1.  Certainly  H0:  d^(xv)  = 
1,  9  <  u;  <  1  is  equivalent  to  H 0-<t>2  =  0.  Unfortunately,  there  are  no  natural 
estimators  for  <j> 2,  although  an  estimate  of  the  form 

f  (d(tu)  —  l)2dtw 
Jo 

is  investigated  in  Section  3.  Eubank  et  al.  suggest  instead  decomposing  <j>2  into 
its  components.  Start  by  selecting  a  complete  orthonormal  sequence  for  £2[0>1] 
(see  Subsection  2.4.4),  {py(tt;)},  and  define 

aj  =  Jo  [<*(#)(«>)  -  lj Pj(w)dw  for  j  =  1, 2, . . . , . 

The  ay’s  are  the  components  of  Pearson’s  phi-squared  and  they  satisfy 

*2  =  E4 

y=i 

The  components  are  estimated  by 

(2.3.7)  a.j  -  J  pj{w)dCDff(rv). 

These  components  bear  a  marked  similarity  to  the  components  of  the  Anderson- 
Darling  and  Cram4r-von  Mises  statistics  discussed  in  Subsection  2.2.5.  In  point  of 
fact,  choosing  py(tv)  =  sin  icjw  yields  the  components  of  the  Cramer- von  Mises 
statistic  and  py(tu)  as  the  Legendre  polynomials  yields  the  components  of  the 
Anderson-Darling  statistic. 

A  test  of  H0 :  <£2  =  0  is  then  equivalent  to  H0:  ay  =  0  for  j  >  1.  Of  course, 
this  latter  hypothesis  is  not  testable.  One  cannot  simultaneously  test  an  infinite 
number  of  parameters.  One  could  weight  the  components  by  forming  a  statistic 
like  £Aydy  where  A2  Var(ay)  <  00  to  arrive  at  an  asymptotically  consistent 
test.  Recall  this  is  the  form  of  the  Anderson-Darling  and  the  Cramer- von  Mises 
statistics.  Eubank  et  al.  suggest  instead  that  one  test  subhypotheses,  such  as 

(2.3.8)  H0:ay  =0  for  j  =  1,...,M 


35 


This  latter  suggestion  is  most  intriguing  as  one  then  gives  equal  weight  in  the 
testing  procedure  to  each  of  the  first  M  ay’s.  This  notion  will  be  discussed  at 
greater  length  in  Section  3.  In  terms  of  implementing  this  suggestion,  note  that 
under  H0 


Cov(dj,flfc)  =  ~A~~  JQ  JQ  p'j{“)K(u,v)p'k(v)dudv 


\N) 

1- A 


*(N) 
1  -  A 


Jq  Pj{w)pk{w)dw  -  pj(w)dw  pk(w)dw] 

—  \  Pj[v)dw  f  pk(w)dw. 

0  JO  Jo 


In  particular,  if  the  orthonormal  sequence,  (py(ti/)},  is  also  orthogonal  to 
Po(w)  =  I  then  the  components  are  asymptotically  Gaussian  and  independent 
with  variance 


1  ~  XW 

V) 


From  the  form  of  (2.3.7),  it  is  seen  that  the  components  are  also  rank  statis¬ 
tics.  This  interpretation  of  the  components  brings  to  light  several  interesting 
prospects.  First,  Eubank  et  at.  (1987)  note  that  the  usual  form  of  the  sequence 
of  py(u;)’s  is  that  they  become  more  oscillatory  as  j  increases.  Typically,  pi 
will  have  one  zero  crossing  in  (0,1),  P2  will  have  two,  p$  three,  and  so  on.  A 
score  function  with  one  zero  crossing  is  testing  location;  one  with  two  crossings 
tests  scale.  More  crossings  can  be  viewed  as  testing  higher  frequency  departures 
from  uniformity.  Testing  the  hypothesis  (2.3.8)  for  the  first  2  components  then 
results  in  a  test  against  both  location  and  scale.  The  independence  of  location 
and  scale  rank  statistics  is  known  in  the  literature  [see  Randles  and  Hogg  (1971) 
and  Boos  (1986)],  but  it  is  presented  in  an  ad  hoc  fashion  with  no  unifying  phi¬ 
losophy.  Eubank  ct  at.  seem  to  be  the  first  to  give  any  framework  and  extension 
to  this  observation.  One  might  choose  the  parameter,  M,  based  on  how  great  a 
departure  is  deemed  worthy  of  testing. 

Second,  one  can  choose  the  orthonormal  sequence  so  that  it  protects  best 
against  certain  distributions  or  types  of  tail  behavior.  For  example,  if  one  wished 
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to  design  a  test  to  be  most  powerful  in  the  case  when  F  and  G  are  long  tailed, 
one  might  choose  to  use  the  cosine  functions  for  the  sequence.  The  cosine  is  the 
optimal  score  function  of  the  Cauchy.  Similarly,  the  Legendre  polynomials  would 
be  suitable  for  medium  tailed  distributions,  such  as  the  logistic  distribution. 

Although  they  do  not  investigate  the  properties  at  all,  Eubank  it  al.  (1987) 
note  that  the  components  do  lead  to  an  estimate  of  d^(w)  as,  say, 

M 
j= i 

Such  estimates  are  referred  to  as  orthogonal  series  estimates  and  will  be  discussed 
in  Subsection  2.4.4.  Eubank  it  al.  present  some  very  intriguing  ideas,  several 
of  which  will  be  taken  up  in  later  sections.  However,  they  do  not  outline  a 
comprehensive  testing  and  estimation  procedure  that  is  being  sought  here. 

2.4.  Review  of  Density  Estimation  Techniques  on  [0,1] 

2.4.1.  Introduction.  The  purpose  of  this  subsection  is  to  review  various 
methods  that  can  be  used  to  estimate  d^(u).  These  techniques  fall  under  the 
general  heading  of  density  estimation.  Since  d^(u)  is  known  to  have  support 
[0,1],  the  implications  of  this  fact  on  the  properties  of  the  estimators  must  be 
closely  examined.  Modifications  of  certain  estimators  for  this  case,  particularly 
kernel  estimators,  have  been  proposed  in  the  literature.  These,  too,  will  be 
reviewed. 

As  with  two  sample  tests,  density  estimators  can  also  be  classified  as  para¬ 
metric  or  nonpar ametric.  A  parametric  estimator  of  a  density  assumes  that 
/  €  7$  where  the  family  of  distributions,  7$,  is  defined  as  Jg  =  {/(x)  = 
6  0,0  C  IR*}.  Given  a  random  sample,  Xj,...,Xn,  from  /(•;$), 
0  is  estimated  by  a  method  such  as  maximum  likelihood,  minimum  chi-square, 
or  the  method  of  moments  [see  Kendall  and  Stuart  (1979)].  The  resultant  estima¬ 
tor  of  /  is  /  =  /(•;  0).  In  particular,  if  6  is  the  maximum  likelihood  estimator  of  6 
then  /(•;$)  is  the  maximum  likelihood  estimator  of  /  by  the  invariance  principle 
[see  Mood,  Graybill,  and  Boes  (1974),  page  284]. 
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Philosophically,  should  there  be  justification  to  assume  that  the  underlying 
density  belongs  to  a  parametric  family  it  is,  as  Good  and  Gaskins  (1971)  remark, 
“a  pity  to  waste  it.”  It  will  be  seen  that  parametric  estimators  generally  enjoy 
faster  rates  of  convergence  of  mean  squared  error  (MSE)  than  nonparametric 
ones.  In  the  case  at  hand,  namely  estimation  of  d\(u)  from  R\/N , . . . ,  Rm/N, 
parametric  techniques  simply  do  not  apply.  There  is  no  reason  to  suspect  that 
the  class  P  of  all  d\(u)  constructed  as  in  Subsection  2.3  can  be  indexed  by  a 
finite  dimensional  parameter,  6.  For  this  reason,  it  is  necessary  to  look  into  the 
realm  of  nonparametric  techniques  of  density  estimation  for  a  methodology. 

Several  nonparametric  density  estimators  will  be  examined  in  detail.  These 
are  the  histogram,  the  kernel  method,  the  orthogonal  series  method  and 
AR/ARMA  methods.  In  this  discussion,  it  is  assumed  that  one  desires  to  es¬ 
timate  a  density,  /,  and  has  at  hand  a  random  sample,  Xj,...,Xn,  from  /. 
The  discussion  will  be  organized  as  follows.  First  the  estimator  is  defined  and 
its  properties  given.  Special  attention  will  be  given  to  representations  of  mean 
squared  error  and  mean  integrated  squared  error  (MISE).  Other  properties  such 
as  weak  and  strong  consistency  will  be  referenced  but  not  detailed.  Second,  the 
implications  of  /  having  support  [0,1]  are  examined.  If  modifications  to  the 
original  estimator  have  been  proposed,  these  will  be  discussed. 

It  is  assumed  that  the  reader  is  familiar  with  the  basic  concepts  of  non¬ 
parametric  density  estimation  and  related  standard  terms  such  as  MISE.  Back¬ 
ground  material  can  be  found  in  the  following  books  and  review  articles:  Silver- 
man  (1986),  Titterington  (1985),  Bean  and  Tsokos  (1980),  Tapia  and  Thompson 
(1978),  Wertz  (1978),  Scott,  Tapia,  and  Thompson  (1977),  Wegman  (1972a)  and 
Wegman  (1972b). 

At  this  point  it  seems  wise  to  reiterate  that  the  properties  of  the  estimators 
described  in  the  following  subsections  are  derived  for  data,  Xi, . . . ,  Xn,  which  is 
iid  /.  Since  this  is  not  the  case  for  Ri/N,  ...,Rm/N  it  should  not  be  expected 
that  the  properties  should  carry  over  in  a  one  to  one  fashion.  Since  the  iid  case 
is  in  many  ways  ideal,  such  results  may  indicate  the  best  that  can  be  done. 

2.4.2.  The  Histogram.  The  histogram  is  the  oldest  of  the  nonparametric 


density  estimators  and  is  best  suited  to  /  having  compact  support.  In  fact, 
difficulties  arise  should  /  have  infinite  support.  Despite  this,  it  is  seen  that  the 
histogram  is  not  best  suited  to  the  needs  at  hand. 

The  histogram  is  constructed  for  data,  . . . ,  zm,  in  the  following  manner. 
Select  bin  edges  t  =  such  that  0  =  to  <  t\  <  •  •  •  <  tm  =  1.  The 

histogram  estimate  is  given  by 

/*(**.*)  =  “ 7r~" - 7.  f°r  U-i  <*< 

n(t,  -  tt_j) 

where  is  the  number  of  data  points  falling  in  the  interval  [tt_j,t{). 

Two  of  the  simplest  properties  of  fh(x ;  t)  are  easily  verified;  namely  fh(x;  t) 
>  0,Vx  and  f  fh{x\  t)dx  =  1.  These  imply  the  estimate  is  itself  a  probabil¬ 
ity  density  function.  Tapia  and  Thompson  (1978)  prove  the  following  theorem 
concerning  the  consistency  in  mean  square  of  the  histogram. 

Theorem  2.4.1  (Tapia  and  Thompson,  1978).  Suppose  that  f  has  continuous 
derivatives  up  to  order  three  except  at  the  endpoints  of  [0, 1],  and  f  is  bounded 
on  [0,  lj.  Let  the  mesh  be  equal  spacing  throughout  [0,  lj,  so  that  t+  -  ft-_j  =  2 h. 
If  h  — *•  0  and  nh  — >  oo  as  n  — ►  oo,  ( note  the  partition  is  now  a  function  of  n) 
then  for  x  €  (0, 1),  MSE*(/\  /)  — ►  0,  where 

MSE ,(/*,/)  =  e(1/N*;«)  -  /(x)]2). 

From  the  details  of  the  proof  of  Theorem  2.4.1,  an  upper  bound  on  MISE  can  be 
derived.  This  bound  is 

MISE(/*,/)  <4  +  2 h2  [l  ?{x)2dx  +  0(n-1)  +  0(h3). 
nh  Jo 

Minimizing  this  bound  with  respect  to  the  bin  width,  h ,  one  finds  that  the  best 
rate  of  convergence  of  MISE(/\/)  is  0(n-2/3). 

There  are  several  criticisms  of  the  histogram.  The  first  is  the  rate  of  conver¬ 
gence  of  MISE.  The  kernel  estimators  examined  in  the  next  subsection  do  better 
than  0(n-2/3).  Second,  it  seems  unfortunate  to  estimate  a  function  which  has 
been  assumed  to  possess  three  continuous  derivatives  by  a  step  function.  If  /  is 
smooth,  it  is  desirable  that  the  estimate  should  be,  too. 
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2.4.3.  The  Kernel  Method.  This  subsection  reviews  kernel  density  estima¬ 
tors.  With  some  modification  for  boundary  effects,  it  is  seen  that  kernels  are  a 
viable  method  of  density  estimation  on  [0,  lj.  Rosenblatt  (1956)  first  suggests  the 
method  of  kernel  density  estimation.  He  examines  in  detail  only  the  rectangular 
kernel,  however.  Parzen  (1962a)  investigates  general  kernels  and  derives  myriad 
results.  In  fact,  such  has  been  the  influence  of  his  research  in  this  area  that  they 
are  often  referred  to  as  Parzen  kernel  estimators. 

The  kernel  density  estimator,  fk,  is  defined  as 


(2.4.1) 


dFn  (») 


where  Fn  is  the  empirical  distribution  function  and  h  is  the  bandwidth  or  window 
width.  Parzen  establishes  the  following  theorem  concerning  the  mean  square 
consistency  of  fk. 


Theorem  2.4.2  (Parzen,  1962a).  Let  x  be  a  continuity  point  of  f  and  suppose 
h  — »  0  and  nh  — ♦  oo  as  n  -*  oo.  Assume  that  K  is  bounded,  absolutely  integrable, 
and  |y/f(y)|  — >  0  as  y  — *  oo;  then  MSE z(/*,/)  — ♦  0  as  n  — ►  oo. 


Under  various  additional  assumptions  on  h,  K  and  /,  he  also  establishes  asymp¬ 
totic  normality  and  uniform  weak  consistency.  >\ 

Strong  consistency  has  been  considered  by  several  authors:  Silverman  (1978), 
Bertrand- Retali  (1978)  and  Nadarya  (1965).  Wahba  (1975)  derives  minimax 
results  for  the  MSE  of  kernel  estimators.  That  is,  for  suitable  restrictions  on  K 
and  h  and  f  €W,  where  W  is  an  appropriate  space  of  densities,  Wahba  derives 
an  upper  bound  on  MSEX(/*,  /)  for  all  /  €  W. 

Parzen  (1962a)  gives  an  asymptotic  representation  of  MSE  and  MISE  of  fk. 
He  assumes  the  existence  of  an  integer  r  >  0  such  that 

1  ~  Hu) 

kt  =  lim — j— if — » 
u—0  U  r 


is  finite  and  nonzero  where  fc(u)  is  the  Fourier  transform  of  K.  The  number  kt  is 
called  the  characteristic  coefficient  and  r  the  characteristic  exponent  of  the  kernel 
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K.  Parzen  (1962a)  assumes  also  that  /(r)(z)  =  f  e-,uxju|r^>(u)<itt  converges 
absolutely,  where  / ^  is  the  rth  derivative  of  /  and  <p  is  the  characteristic  function 
of  /.  In  this  case,  MISE  and  MSE  admit  the  following  expansions: 

(2.4.2)  MSE,(/*,/)  ~  f  K(y)2dy  +  kir\Krflr'>(z)\i , 

(2.4.3)  MISE(/*,/)  ~  ^  /  K(y)2dy  +  h2rk2  j  /  W  (*)><!*. 

In  each  of  these  expansions,  the  first  term  is  the  contribution  due  to  variance 
and  the  second  due  to  squared  bias.  Minimizing  (2.4.3)  with  respect  to  h,  one 
sees  that  the  best  rate  of  convergence  of  MISE  is  0(n“2r/(2r+1)).  K  is  normally 
chosen  to  be  a  probability  density  function  which  is  symmetric  about  0  and  has 
a  finite  variance  (which  implies  r  =  2),  in  which  case  the  best  rate  of  convergence 
is  0(n“4/5).  This  rate  is  better  than  that  of  the  histogram. 

The  discussion  will  now  center  on  events  at  z  =  0  (however,  any  conclusions 
hold  for  the  other  endpoint,  z  =  1).  Suppose  that  it  is  known  that  /  has  support 
[0,  lj  and  is  continuous  on  (0, 1).  If  /(0)  =  /(l)  =  0,  so  that  /  is  continuous 
on  1R,  then  all  the  standard  results  noted  above  still  apply.  Now  suppose  that 
/(0)  >  0;  /  has  a  simple  discontinuity  at  z  =  0.  Theorem  2.4.2  now  fails  at 
z  =  0. 

Let  if  be  a  symmetric  density  function  and  consider  fk(0\h).  Equation 
(2.4.1)  simplifies  to 

/fc(0; h)  =  j'  K(-y/h)iF„(y), 

which  means  that  one  is  using  K (y)  only  for  y  <  0.  In  this  case,  one  can  define  the 
effective  kernel  to  be  Ke(y)  =  where  I  is  the  indicator  function. 

For  Ke  the  characteristic  exponent  is  r  =  0  since  f  Ke(y)dy  =  j.  Referring 
to  equation  (2.4.2),  one  sees  that  MSEo(/*,/)  — ►  ?/(0)2  as  /i  — ►  0,  nh  — ►  oo 
and  n  — >  oo,  or  that  Biaso(/*,/)  — >  —  j/( 0).  The  problem  is  that  the  fk  is 
converging  in  mean  square  to  j/(0)  [see  Schuster  (1985)].  There  is  no  difficulty 
with  the  variance  term,  only  the  bias  term. 

To  investigate  this  phenomenon  more  closely,  start  by  assuming  that  K  has 
compact  support.  For  x  >  ht  there  is  no  problem  and  the  usual  definition  (2.4.1) 
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»  x# 


Fig.  3.  Position  of  the  kernel  to  estimate  the  density  at  xo.  Region  (a)  on  the 
right  corresponds  to  region  (6)  on  the  left.  Region  (a)  is  unused  in  the  estimation. 

applies.  For  x  <  h,  part  of  the  kernel  is  clipped;  it  will  never  be  used  since  /  is 
zero  on  (-oo,0).  So  fk(x\  h)  can  be  rewritten  as 

=  if0'Kl  "■»(»). 

where  s  =  x/h  and 

ff(y)Jj_l,,](y),  for  0  <  s  <  1; 

■^(y)-^— i,i)(y)»  for  s  > 

is  the  effective  kernel.  Figure  3  depicts  this  phenomenon  graphically.  For  esti¬ 
mating  /  at  xo,  one  uses  a  kernel  “sitting”  on  the  point  xo-  A  portion  of  the  right 
tail  of  K,  which  is  the  mirror  image  of  that  portion  of  the  left  tail  falling  below 
0,  is  never  used.  So,  instead  of  using  a  kernel  with  r  —  2,  one  is  actually  using  a 
kernel  with  r  =  0.  The  bias  is  greatly  increased.  However,  for  points  away  from 
x  =  0,  the  MSE  behaves  as  usual  asymptotically.  This  is  so  because,  for  fixed 
xo,  s  =  xo fh  =  1  when  h  =  xq.  As  h  decreases  to  xo,  the  kernel  sitting  at  xo 
no  longer  reaches  to  x  <  0  and  so  the  effective  kernel,  Ke ,  is  just  K.  Since  h  is 
tending  to  0,  eventually  h  =  x q  for  every  xo  >  0  and  so  the  usual  asymptotics 
apply  to  all  x  >  0. 

From  this  discussion,  it  should  be  plain  why  only  kernels  of  compact  support 
are  considered.  Kernels  of  infinite  support  will  always  cross  over  to  x  <  0  and  sc 


k'Av) 
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Fig.  4.  The  kernel  density  estimate  for  a  random  sample  of  size  250  from  the 
uniform  [0,  lj  distribution.  The  estimate  is  constructed  using  the  biweight  kernel 
and  h  —  0.15. 


will  always  be  clipped.  Hence,  no  matter  how  small  h  is  taken,  boundary  effects 
will  be  experienced  for  each  x  €  (0,  l].  Gasser  and  Muller  (1979)  make  note  of 
this  fact. 

In  finite  samples,  one  should  expect  to  see  boundary  effects  for  x  between  0 
and  h.  Figure  4  demonstrates  this  point.  The  figure  presents  the  kernel  density 
estimate  based  on  250  iid  observations  taken  from  the  uniform  [0,  l]  distribution. 
The  biweight  kernel,  K(t)  =  j|(l  -  t2)2/j_i  i](<),  and  bandwidth,  h  =  0.15, 
are  used.  Notice  how  the  estimate  starts  to  bend  downward  for  i  <  0.15  and 
x  >  0.85.  Clearly,  the  situation  is  not  satisfactory. 

Several  proposals  have  been  made  to  correct  the  situation.  The  cut  and 
normalize  method  normalizes  the  effective  kernel,  Kg,  so  that  it  integrates  to 
one.  Define  the  cut  and  normalize  kernel  Kcsn{t )  as 

*?(<■)  =  *,'(<)/  K'Mdy. 

This  normalization  moves  the  characteristic  exponent  from  r  =  0  to  r  =  1  for 
0  <  s  <  1.  In  the  interior,  one  is  using  a  kernel  with  characteristic  exponent 
r  =  2  and  at  the  boundaries  a  kernel  with  characteristic  exponent  r  =  1.  To 
examine  the  MISE  of  this  estimator,  define  p(s)  =  /_^s)  K(t)*dt  and  j/(s)  = 
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(jA/(»)  ,  where  M(s)  =  min(s,  1).  The  MSE  of  the  cut  and  normalize 

estimator,  fcn(x;h ),  is 

MSE,(/‘",/)  =  +  h2v(z/h)f'(z)\ 

and  the  MISE  is 

MISE(/cn,/)  =  jjL  J\(z/h)f(z)dz  +  h2  J\(z/k)f(xfdz 

=  fk  nh  in  +  J  v(z/h)/'(x)2dz 

=  ^  +  0(n_1)  +  i  ^  li(y)f(hy)dy  +  A3  J  v(y )f'[ky)2dy 
=  ^  +  n  /o' ^  (/(0)  +  *1<M)  dy 

+  *3  f  "(y)  (/'(o)2  +  fl2(*y))  +  +  o(»_l) 

=  -^  +  &3/'(0)2  jf  ‘  v(y)dy  +  0(0  +  o(A3), 

where  iZi(hy)  =  o(l)  and  Ri{hy)  =  o(l).  Note  that,  Ve  >  0,  |i?1(hy)|  <  e  if 
|/iy|  <  6  which  will  occur  if  jh|  <  6  since  0  <  y  <  1.  Thus  j  /x(y)Ri(hy)dyj 
<  fo  Hy)Ri(hy)\dy  <  Stf  \Ri{hy)\dy  <  S  fj  edy  <  Se  if  \h\  <  6,  where  5  = 
suPo<t<l  |m(OI*  Thus  /q  n(y)Ri(hy)dy  =  o(l)  as  h  — ►  0.  The  best  rate  of 
convergence  of  MISE  is  0(n~3/4),  not  0(n-4/5)  as  normally  expected  for  a 
kernel  of  characteristic  exponent  r  —  2.  The  poor  behavior  of  MSE  at  the  ends 
dominates  the  entire  MISE  calculation.  Such  results  call  the  use  of  the  cut 
and  normalize  kernel  into  question.  It  should  be  noted  that  this  result  is  in 
contradiction  to  the  statement  of  Gasser  and  Muller  (1979);  “...end  effects  may 
dominate  the  global  asymptotic  behavior  (for  nonparametric  regression).  Note 
that  this  problem  does  not  arise  for  kernel  estimation  of  densities.” 

Another  method  of  dealing  with  the  boundary  effect  is  the  method  of  re¬ 
flection,  which  is  detailed  by  Schuster  (1985).  He  defines  new  random  variables, 
Y{  —  S{Xi,  where  P[St-  =  1]  =  .P[S^  =  —  1)  =  \  and  the  S^s  are  independent 
of  the  Xt’s.  The  density  function  of  Y  is  /y(y)  =  /( |y|)/2  which  is  continuous 


at  y  =  0  and  so  kernel  methods  should  be  satisfactory.  If  K  is  symmetric,  the 
estimate  of  /  does  not  depend  on  the  St’s  and  is 


In  terms  of  the  usual  kernel  density  estimation  formula  (2.4.1),  the  kernel  is  found 

_  /  [*W  +  -  0U|-l,,|(').  for  0  <  a  <  1; 

’  for  5  >  1, 

where  s  =  x/h  as  before.  Referring  to  Figure  3,  this  method  amounts  to  “folding 
over”  the  unused  portion  of  the  kernel  in  region  (a)  back  into  the  region  where 
the  data  lies. 

Given  a  standard  kernel  representation  for  fr(x;  h),  the  MSE  and  MISE 
can  be  examined.  It  can  be  verified  that  KT3{t)dt  =  1  for  all  a  but  that 
tifj(t)dt  /  0  for  0  <  s  <  1.  Again,  one  expects  to  observe  the  degraded 
MISE  characteristics  the  cut  and  normalize  method  experiences.  Schuster  does 
point  out,  however,  that  fT  is  non-negative,  integrates  to  1  and  is  asymptotically 
normal. 

Both  the  cut  and  normalize  kernel  and  the  reflection  kernel  are  boundary 
kernels.  That  is,  they  are  kernels  which  change  their  shape  when  estimating  / 
near  the  boundary.  One  wonders  whether  it  is  possible  to  define  a  boundary 
kernel  in  such  a  way  that  the  normal  MSE  and  MISE  properties  are  preserved. 
At  least  two  authors  have  investigated  this  possibility.  Rice  (1984)  states  the 
problem  in  terms  of  nonparametric  regression  but  notes  that  the  results  translate 
directly  to  the  density  estimation  problem.  In  this  discussion,  his  results  shall 
be  given  in  terms  of  density  estimation.  Since  the  cause  of  difficulty  is  the  bias 
term,  Rice  uses  a  jackknife  approach  to  reduce  the  bias.  This  is  similar  in  spirit 
to  the  approach  Schucany  and  Sommers  (1977)  take  to  reduce  the  bias  of  kernel 
estimators  in  the  non-boundary  case.  If  /  admits  a  second  order  Taylor’s  series 
expansion,  E[/fc(z;h)]  can  be  represented  as 
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tx/h 

=  L  ,ILK(y)f(x~hy)dy- 

Now  suppose  0  <  x  <  h,  so  that 

E|/*(x;  h)]  =  J’  K(y)\f(x)  -  f'(x)hy  +  i/"(*)fc2y2  +  o (fcV))<<y, 
where  s  =  x/h,  thus 

(2.4.4)  E[/fc(x;h)]  =  f(x)w0(s)  -  hf\x)w i(x)  +  ^h2 f" (x)iv2{x)  +  o(h2), 

where  u>t-(s)  =  t*K(t)dt  for  *  =  0,1,2.  Rice  (1984)  suggests  defining  a  new 
estimator  by 

f’(x-,h,a,l), a)  --  i(/‘( r,h)  -  0| /*(*;(.)  -  /*(x;afc)|). 

Using  (2.4.4),  one  finds  the  expected  value  of  p  to  be: 

E [P{x;h,ot,l3,a)  =  ^(((l  - /?)tw0(s)  +  0wo(s/a)]f{x) 

-  [(1  “  /3)«>l(s)  +  a0wi(s/a)]hf'{x) 

+  ^K1  ~  0)u>2(«)  +  a20w2{s/a)\h2 /"(x))  +  o(/i2). 

The  parameters  a,  /3,  and  a  need  to  be  chosen  so  that  this  expectation  is  f(x)  + 
const. h2.  This  is  accomplished  by  setting 

O  =  (l-0)wo(s)+0wo(a/or).  e=wMZl(‘li(slay 

The  parameter  a  is  still  free;  Rice  suggests  setting  a  =  2  —  s  so  that  one  always 
smooths  over  an  interval  of  length  2 h.  The  kernel  which  defines  f3  is 
KUt)  =  [  1(1  ”  0a)K°m  +  for  0  <  s  <  1; 

\  K(t) ,  for  s  >  1, 

where 

K°(t)  =  «■(()/,.,, ,|(f)  and  JTl(l)  = 

It  isn’t  hard  to  show  that  k{  satisfies  f\ l  Ki{t)dt  =  1  and  tK3s(t)dt  =  0  for 
all  s. 

Figure  5  presents  Kl  for  s  =0,  0.25,  0.5,  and  0.75  for  K(t)  equal  to  the 
biweight  kernel.  Notice  that  these  kernels  do  eventually  have  negative  regions  as 


Fig.  5.  Rice’s  boundary  kernels  for  s=0,  0.25,  0.5,  and  0.75.  The  solid  line  is 
s  =  0;  the  broken  line,  s  =  0.25;  the  broken  line  with  points  interspersed,  s  =  .5; 
and  the  solid  line  with  nodes,  s  =  .75. 


5  decreases.  This  is  the  price  that  must  be  paid  to  keep  the  first  moment  equal 
to  zero  throughout.  The  usual  MSE  and  MISE  results  will  apply  to  this  kernel. 
It  appears  that  one  has  the  choice  of  non-negative  estimates  with  bias  of  order 
O(h)  or  potentially  negative  estimates  with  bias  of  order  0(h2). 

From  (2.4.4)  it  can  be  seen  that  the  essential  conditions  for  bias  reduction 
are 


(2.4.5) 


i.  J  i  Ks{t)dt  =  1, 

ii.  J  tKs{t)dt  =  0, 


for  all  0  <  s  <  1.  Approaching  the  problem  from  this  standpoint,  Gasser  and 
Muller  (1979)  propose  a  boundary  kernel  of  the  form 


tff"(0  = 


m 


for  0  <  s  <  1; 
for  s  >  1, 


where  0S  and  4>s  are  chosen  to  be  continuous  functions  of  /  such  that  constraints 
(2.4.5)  hold  for  all  s  and  =  1  and  <f> i  =  0.  Using  the  biweight  kernel  and 
substituting  the  form  of  Kf™  into  the  constraints  (2.4.5),  0S  and  <j>s  are  the 


Fig.  6.  Gasser  and  Muller's  boundary  kernels  for  s=0,  0.25,  0.5,  and  0.75.  The 
solid  line  is  s  =  0;  the  broken  line,  s  =  0.25;  the  broken  line  with  points  inter¬ 
spersed,  s  =  0.5;  and  the  solid  line  with  nodes,  s  =  0.75. 


solutions  to  the  following  linear  equations: 

i(s5  +  l)-|(s3  +  l)  +  s  +  l 


+ 


(2.4.6) 


g(»6  - 1)  -  - 1)  +  ^(»2  - 1) 


<t>s  =  1 


;(**  - 1)  -  ;(»4  - 1)  +  ;(«2  -  i) 


4* 


i(.7  +  1)  -  ?(,5  +  1)  +  j(s3  +  1) 


<t>s=  0. 


Graphs  of  Kfm(t)  for  s=0,  0.25,  0.5,  and  0.75  are  displayed  in  Figure  6. 
Again  notice  that  the  kernel  must  eventually  have  negative  regions  to  satisfy 
the  constraints  (2.4.5).  Although  for  given  s  the  support  of  Kgm  and  K]  are 
different,  their  shapes  are  similar  in  that  one  can  see  a  progressive  and  continuous 
deformation  of  the  original  kernel.  Despite  needing  to  solve  (2.4.6)  for  0a  and  <j>s, 
the  Gasser-Muller  kernel  may  be  somewhat  easier  to  work  with  since  its  support 
depends  on  s  on  only  one  side.  Both  kernels  are  rational  functions  of  s. 

The  Gasser-Muller  boundary  kernel  for  the  right  hand  endpoint,  z  =  1,  is 
easily  obtained.  One  could  derive  the  expression  for  the  expected  value  of  the 
estimator  for  the  right  endpoint.  One  would  then  notice  the  same  problem  except 
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that  the  integrals  are  from  —  a  to  1  instead  of  —1  to  s.  In  this  case,  s  is  given 
by  a  =  (1  -  x)/h.  Due  to  the  symmetry  of  the  problem,  it  is  easily  shown  that 
*;,(«,)  =  K[  (— u>),  where  a1  =  (1  —  x)/h,  a  =  x/h,  Kl  is  the  left  hand  boundary 
kernel  and  Kr  is  the  right  hand  boundary  kernel. 

2.4.4.  Orthogonal  Series  Methods.  Orthogonal  series  methods  are  reviewed 
in  this  subsection.  Many  orthogonal  series  lend  themselves  naturally  to  the  es¬ 
timation  of  densities  on  [0, 1].  Care  must  be  given  to  the  choice  of  orthogonal 
functions,  however,  since  this  choice  can  have  profound  effects  upon  the  proper¬ 
ties  of  the  estimator.  Although  orthogonal  series  ideas  are  taken  up  in  Section 
3,  the  traditional  methods  are  seen  in  this  subsection  not  to  be  exactly  what  is 
sought. 

The  expansion  of  non-random  functions  by  orthogonal  series  is  a  commonly 
used  technique  of  mathematical  analysis  [see  Stromberg  (1981)].  Cencov  (1962) 
was  the  first  to  suggest  that  such  techniques  could  be  useful  in  the  area  of  density 
estimation. 

Start  by  assuming  that  /  has  support  [0,1].  There  do  exist  orthonormal 
bases  on  1R,  such  as  the  Hermite  functions,  however,  these  are  not  of  primary 
interest  here.  Let  {^y(x)}yl1  be  a  complete  orthonormal  basis  for  £2(0,  l],  which 
is  the  space  of  all  functions  on  (0,  l]  which  are  square  integrable  with  respect  to 
the  weight  function,  w(z).  A  basis  is  orthonormal  with  respect  to  the  weight 
function  w(x)  if 

j  <i>j(x)<t>k{x)w[x)dx  =  I{j  =  k ), 

where  /(/  =  k)  is  the  (Kronecker’s  delta)  indicator  function.  The  basis  is  com¬ 
plete,  if  for  all  g  €  £2(0*  l]»  there  exists  a  sequence  of  constants  such 

that 

n 

II*  “  51  — ►  0  as  n  -+  oc. 

k=l 

The  sequence  {a*}  is  given  by  ak  =  /q1  <f>k{x)g(x)w(x)dx,  and  ||  •  ||  is  the  £2(0, 1] 
norm  with  respect  to  the  weight  function,  jj</||2  =  /0l  g(x)2w(x)dx. 


« 
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Assume  that  the  density  /  admits  the  series  representation, 


oo 


/(x)  ~  53  a^k(x)t 

k=l 

where  =  f  <f>k{x)t v(x)f(x)dx.  The  orthogonal  series  estimate  of  the  density  is 

defined  to  be 

oo 


(2.4.7) 

where 


k=  1 


flfc  =  f  <^fc(x)u>(x)dFn(x) 
JO 

=  ±£w(Xi)*t(Xi)- 


1  =  1 

The  sequence  {A^}  is  the  smoothing  parameter,  which  one  expects  to  decline  to 
0  a s  Jt  becomes  large.  Several  forms  for  the  sequence  have  been  suggested.  For 
the  moment,  it  will  be  assumed  that 

1,  if  k  <  m; 
otherwise. 

The  estimator  of  equation  (2.4.7)  now  becomes 

m 

(2.4.9)  f°{x\  m)  =  £3  oJfc^(x). 

fe=  l 


(2.4.8) 


A* 


=r 

lo, 


The  parameter  m  is  referred  to  as  the  truncation  point  of  the  series  and  plays 
the  role  of  smoothing  parameter. 

Kronmal  and  Tarter  (1968)  consider  estimators  of  form  (2.4.9).  Although 
any  orthonormal  basis  {<£fc}  will  do,  they  focus  on  trigonometric  (or  Fourier) 
series.  There  are  at  least  four  distinct  bases  involving  trigonometric  functions; 
these  are 

i.  ^jt(x)  =  cosxfcx,  k  >  0; 

ii.  <t>kix)  -  sin  xfcx,  k  >  1; 

iii.  <t>k{x)  —  (cosirfcx,sin?rfcx),  k  >  1; 

iv.  (fr/cix)  =  (cos  2jrfcx,sin27rfcx),  k  >  0. 

These  will  be  referred  to  as  bases  (i),  (ii),  (iii),  and  (iv),  respectively. 
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Kronmal  and  Tarter  (1968)  give  results  for  (i),  (ii),  and  (iii),  whereas  (iv) 
is  recognized  as  the  basis  function  used  by  Parzen  (1983)  in  the  AR  spectral 
approach  to  estimating  d\(u)  discussed  in  Subsection  2.3.4  and  by  Parzen  (1979) 
and  Carmichael  (1984)  in  a  general  density  estimation  setting.  Trigonometric 
functions  are  often  used  as  they  are  convenient,  easy  to  calculate  and  their  prop¬ 
erties  are  well  known.  Kronmal  and  Tarter  also  note  that  since  trigonometric 
functions  differentiate  and  integrate  to  other  trigonometric  functions,  one  does 
not  have  to  choose  between  an  orthogonal  series  expansion  of  F  and  /.  For  the 
trigonometric  bases,  the  weight  function  is  t u(x)  =  2.  Other  orthonormal  bases 
do  exist  for  £2[0,  l]i  the  Legendre  polynomials  are  an  example. 

Although  bases  (i)  through  (iv)  are  complete  in  £21  R  is  well  known  [see 
Wahba  (1975)  or  Hall  (1981)]  that  the  estimates  obey  certain  conditions.  Though 
the  details  are  different,  it  was  seen  in  Subsection  2.3.4  that  basis  (iv)  imposes 
dA(0)  =  dA(l)  upon  the  estimates,  as  it  does  on  f°.  Basis  (ii)  also  imposes  this 
condition.  In  the  case  that  /(0)  /  /(l),  these  trigonometric  series  exhibit  Gibb’s 
phenomenon.  The  estimates  will  tend  to  be  very  wiggly  near  z  =  0, 1,  and  in 
fact  f°  will  be  estimating  (/( 0)  -f  /( l))/2  at  the  points  z  =  0, 1.  This  poses  no 
difficulty  to  the  £2  convergence  of  the  series  since  the  £2  norm  is  insensitive  to 
pointwise  errors.  Hall  (1981)  discusses  Gibb’s  phenomenon.  He  finds  that  the 
rate  of  convergence  of  MISE  can  be  as  bad  as  0(1/ y/n).  Newton  (1988),  page  77, 
has  an  excellent  general  discussion  of  Gibb’s  phenomenon.  The  problem  arises  in 
part  due  to  the  choice  of  A*.  as  (2.4.8).  In  time  series  analysis,  it  is  usual  to  give 
the  Afc’s  a  damped  form  to  reduce  this  problem.  Basis  (i)  imposes  the  following 
end  conditions  on  the  derivatives,  (0;  m)  =  /o( 2*-l)(1),  for  k  >  1.  Hall 

(1983b)  finds  that  this  series  is  far  more  resistant  to  Gibb’s  phenomenon  than 
any  of  the  other  three. 

For  the  cosine  series,  basis  (i),  the  density  estimator  is  found  by  applying 
equation  (2.4.9),  with  the  exception  that  the  weight  function  w(x)  =  1  not  w(x)  = 
2  is  appropriate  for  k  =  0.  Bearing  this  in  mind,  (2.4.9)  becomes 

m 

/*(z;m)  =  -~  +  ^2  a*  cos  Trkz, 
k=l 


» 
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where 

2  ^ 

djt  =  —  y.  cos  rkXi  for  k  >  0. 

”  *=l 

In  the  case  of  bases  (iii)  and  (iv),  the  terms  are  taken  in  sine/cosine  pairs  so  that 
the  estimator  is 

/*“■•*(*;  m)  =  y  +  £  Mfc(x)  +  £  Mfc(x), 

fc=l  jk=i 

where 

ik  =  lT,*UXi)fOTk>0, 

it 

i=l 

t*  =  f>{(*d  for  *  >  1. 

”  1=1 

For  basis  (iii),  ^(z)  =  coairkx,  <t>ak(x )  =  sinjrfcx,  and  oq  =  0;  for  basis  (iv), 
4>\.{x)  =  cos  2 nkx,  <i>k{x)  =  sin27rA:x,  and  oq  =  2. 

One  cannot  be  assured  that  f°  constructed  from  any  of  these  bases  will  be  a 
valid  density.  In  particular,  f°  can  become  negative.  Kronmal  and  Tarter  (1968) 
make  two  points  concerning  this  issue.  First  they  note  that  in  all  their  simulations 
they  did  not  come  up  with  a  negative  estimate.  This  point  seems  somewhat  weak 
as  it  is  based  solely  on  a  handful  of  simulated  data  sets.  Their  second  point  is 
that  negative  estimates  are  not  a  complete  anathema.  Negative  estimates  should 
serve  as  a  warning  that  inference  in  the  negative  region  is  hazardous;  that  there  is 
insufficient  data  in  the  region.  This  second  point  seems  a  much  more  appealing 
response.  The  estimate,  /°(z;m),  at  a  point  is  a  random  variable  taking  on 
values  in  91;  if  /  is  small  at  x,  there  is  no  reason  to  be  surprised  that  f°(x;  m ) 
should  be  negative. 

Notice  that  /*  and  Pv  do  integrate  to  1.  If  the  estimate  is  non-negative, 
then  the  result  is  a  density.  If  the  estimate  has  negative  regions,  than  the  fact 
that  it  integrates  to  1  is  of  no  interest.  The  key  condition  is  non-negativity:  given 
a  non-negative  (and  integrable)  estimate  ors  can  always  normalize  it  to  arrive 
at  a  probability  density. 

As  with  histograms  and  kernel  density  estimators,  the  performance  of 
f°{x ;  m)  is  usually  measured  by  MSE  and  MISE.  Kronmal  and  Tarter  give  the 
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MISE  as 

MISE(/M)  =  f>ar(afc)  +  f>l 

Jfc=l  k= 1 

Kronmal  and  Tarter  (1968)  establish  that  if  m  =  o(v/n),  then  MISE(/\  /)  -»  0 
as  n  — ►  oo.  Using  basis  (iv),  Hall  (1981)  derives  the  following  relations  for 
MISE  assuming  that  /  possesses  r  derivatives  and  that  /^)(0)  =  /W(l),  for 
j  —  0, . . . ,  r  1, 

MISE(/(*1'),/)  =  ~  +  (2/|  ^[/MW  -  /<r>(l)lm~2r~‘ 

The  best  rate  of  convergence  of  MISE  is  0(n(2r+1)/(2r+2))  which  is  0(n5/6)  for 
r  =  2.  This  rate  improves  somewhat  that  of  the  boundary  kernel  (Gasser-Miiller 
or  Rice),  which  is  0(n4/5),  for  r  =  2.  The  improved  rate,  however,  is  obtained 
only  at  the  cost  of  requiring  /  and  its  first  r  —  1  derivatives  to  be  periodic. 

Hall  derives  a  similar  result  for  basis  (ii),  by  requiring  /  to  possess  2r 
derivatives  satisfying  /(2J) (0)  =  /(2jf)(l)  =  0,  for  j  =  0, A  rate  of 
0(n(4r+1)/(4r+2))  is  then  obtained.  For  basis  (i),  he  requires  /  to  possess  2r  +  1 
derivatives  which  satisfy,  /(2j+1)(0)  =  /(2j+1)(l)  =  0,  for  j  =  1, . . . ,  r.  In  this 
case,  the  best  rate  of  convergence  of  MISE  is  0(n(4r+3)/(4r+4)).  Notice  that  in 
this  case  the  restrictions  apply  only  to  the  derivatives  of  /,  not  to  the  end  values 
of  /  itself.  This  result  must  be  related  to  the  observation  that  series  (i)  is  the 
most  resistant  to  Gibb’s  phenomenon. 

Return  now  to  the  general  definition  of  the  orthogonal  series  estimator  given 
by  (2.4.7).  Until  now,  the  special  form  of  A*  of  equation  (2.4.8)  has  been  assumed. 
One  wonders  if  there  might  be  a  better  choice  of  weights.  Watson  (1969)  finds 
the  weights  which  minimize  the  MISE  of  f°  to  be 

.  °i/E  fow*] 

‘  l(l+(.-l)4/E(tf)!])' 

He  notes  that  for  fixed  k  that  A*.  — ►  1  as  n  — ►  oo.  He  concludes  that  ordinary  trun¬ 
cation  will  probably  be  sufficient  if  the  a^’s  are  large  compared  to  Var(^jt(-^))/n 
for  k  <  m  and  negligible  for  k  >  m. 


*  t 


53 


Wahba  (1981)  defines  an  estimator  similar  in  spirit  to  Watson’s  using  basis 
(iv).  She  assumes  the  same  periodicity  conditions  on  /  and  its  first  m  -  1 
derivatives  as  Hall  (1981),  above.  She  defines  the  weights,  A*,  parametrically 
and  gives  them  a  Bayesian  interpretation.  The  final  form  of  her  estimator  is 


n/2 


fn,X,m{x)  =  1  +  2  £  COS  2*kx  +  **  . 

k=l  '  ' 


Wahba  shows  that  if,  X  =  an  2rr»/(2m+1)j  for  some  constant,  o,  that  MISE  = 
0(n“2m/(2m+1)). 

As  a  final  point  on  orthogonal  series  density  estimation,  it  ^eems  natural  to 
ask  if  it  bears  any  relation  to  kernel  density  estimation.  The  answer  is  yes  and 
in  the  case  of  basis  (iii)  Kronmal  and  Tarter  (1968)  give  the  relation  as 


where 


/(m)(x;  m)  =  ^  6m{Xi  -  x), 

71 


*=i 


_  sin[(2m  4-  1)ttx/2] 
sin(7rx/2] 


is  known  as  the  Dirichlet  kernel.  There  is  no  explicit  bandwidth  for  the  Dirich- 
let  kernel,-  instead  m  plays  the  role  of  the  smoothing  parameter.  The  relation 
between  m  and  h,  the  bandwidth  of  the  usual  kernel  estimator,  is  approximately 
h  ~  1/m.  Figure  7  displays  £m(x)  for  m  =  2,4,8  and  16.  Graphically,  it  is  easy 
to  see  the  role  of  m.  Note  that  this  kernel  is  not  unimodal  and  not  non-negative. 
Interestingly,  the  usual  kernel  approaches  a  delta  function  (unbounded  at  zero, 
zero  elsewhere)  as  h  — »  0,  whereas  6m  does  not  as  m  — +  oo.  Although  6m  becomes 
unbounded  at  zero,  the  side  lobes  never  decay  to  zero.  This  raises  an  interesting 
question:  Should  one  choose  an  orthogonal  series  that’s  convenient  and  not  worry 
about  the  kernel  representation  or  conversely?  It  is  hard  to  imagine  anyone  ap¬ 
proaching  this  problem  from  the  kernel  perspective  actually  choosing  to  use  the 
Dirichlet  kernel. 


2.4.5.  A  R/ ARM  A  Methods.  The  last  categories  of  density  estimates  to 
be  examined  are  the  autoregressive  (AR)  and  autoregressive  moving  average 


54 


Fig.  7.  The  Diriehlet  kernel.  Figure  (a)  is  constructed  with  m  =  2;  Figure  ( b ), 
m  =  4;  Figure  ( c ),  m  =  8;  and  Figure  (d),  m  =  16. 


(ARMA)  estimators.  The  AR  and  ARMA  methods  are  natural  for  densities  of 
compact  support.  It  is  seen  in  this  subsection  that  the  AR  approach  imposes 
certain  restrictions  on  the  estimated  density.  The  ARMA  approach  is  somewhat 
less  restrictive  and  contains  as  a  special  case  the  cosine  based  Fourier  series 
estimate. 


The  form  of  the  AR  estimator  has  been  given  in  Subsection  2.3.4,  however, 
it  is  repeated  here  for  the  sake  of  completeness.  The  estimator,  fAR(x;  m),  is 
defined  by 

m 

/^(x;m)=^|l  +  £a,«2^|  , 

J  =  1 

where  the  ay’s  are  complex-valued  and  |  •  |2  denotes  the  complex  squared  modu¬ 
lus.  The  ay’s  and  are  estimated  from  the  data.  The  discussion  of  Subsection 
2.3.4  carries  over  exactly  with  the  exception  of  the  estimates  of  the  pseudo- 
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m  to  grow  with  n  at  an  appropriate  rate.  The  estimate,  f^R(x;m),  is  itself  a 
density;  it  is  non-negative  and  integrates  to  one.  As  noted  in  Subsections  2.3.4 
and  2.4.4,  the  estimates  have  the  property  that  /^(0;m)  =  f^R(l\m).  One 
should  expect  the  estimates  to  exhibit  bias  if  this  condition  is  not  met  by  the 
underlying  density. 

Hart  (1988)  suggests  what  he  calls  an  ARM  A  density  estimate.  It  is  defined 
as 

m 

/ARMA[X.  m  _j_  2  ^2  frj  cos  xjx)\\  +  ae*tJX\~2 

3=1 

=  /* (z; m)  +  2Real([amaei(m+1)I]/[l  -  aetVl]), 

where  am  is  the  estimate  of  the  mth  Fourier  coefficient  of  /  and  /*  is  defined  in 
Subsection  2,4.4.  fARMA  jg  ca[]e(j  ^  ARMA  density  estimate  because  the  form 
of  fARMA  js  gimiiaj  to  that  of  an  ARMA(l,m)  spectral  density.  Hart  specifi¬ 
cally  uses  a  cosine  based  series  to  minimize  the  Gibb’s  phenomenon  experienced 
by  the  estimate.  The  pair,  (m,a),  constitute  the  smoothing  parameter.  Since 
jARMA[x.  mjo)  =  /*(x;m),  fARMA  is  a  more  general  estimator  than  the  cosine 
Fourier  series. 

Hart  derives  exact  and  approximate  MISE  results.  In  particular,  if  the 
Fourier  coefficients,  {a*},  of  /  and  a  =  a(m),  obey  either  (a)  jpaj  -*  K  0  as 
j  — ♦  oo  and  m(l  —  a)  — ♦  c  >  0  as  m  — ►  oo  or  (b)  (— 1)*  j^oy  —>K^Oasj—>oo 
and  m(l  —  a)  — »  c  >  0  as  m  — ►  oo,  for  p  >  j,  the  best  rate  of  convergence  of 
MISE  is  0(n1/2*’-1)  for  both  fARMA.  p  the  case  where  c  =  p,  Hart 
shows  that  MISE(/^Wi*, /) /MISE (/*,/)  <  1;  that  is,  even  though  the  two 
obtain  the  same  rate,  the  constant  in  0(n1/2p_1)  for  fARMA  is  smaller.  This 
result  is  reasonable  given  the  class  of  cosine  Fourier  estimates  is  a  subset  of  the 
ARMA  estimates. 

2.4.6.  Choosing  the  Smoothing  Parameter.  Each  estimator  discussed  in 
Subsections  2.4.2  through  2.4.5  is  indexed  by  some  sort  of  smoothing  parameter. 
Let  s  denote  a  generic  smoothing  parameter.  In  this  subsection,  various  methods 
of  choosing  s  will  be  discussed. 


One  of  the  first  methods  suggested  is  to  choose  a  to  minimize  MISE.  Unfor¬ 
tunately,  in  each  case  the  optimal  value  depends  in  some  way  on  the  unknown 
density,  /.  For  the  histogram,  one  needs  f  f'(x)2dz;  for  kernels,  /  f"{x)2dx  (as¬ 
suming  r  =  2);  for  orthogonal  series,  the  Fourier  coefficients.  Several  suggestions 
have  been  made  to  overcome  this  difficulty.  One  method  estimates  the  unknown 
quantity  by  the  value  it  would  have  if  /  falls  in  some  parametric  class;  see  Scott 
(1979)  and  Silverman  (1986).  In  the  case  of  kernel  estimates,  one  can  estimate 
the  unknowns  from  the  data  nonparametrically;  see  Woodroofe  (1970)  and  Scott, 
Tapia  and  Thompson  (1977).  The  parametric  methods  generally  perform  ade¬ 
quately  if  /  resembles  the  assumed  family  (for  example,  if  /  is  unimodal).  The 
nonparametnc  methods  fail  to  perform  well  in  the  simulation  studies  Bowman 
(1985)  conducts. 

The  second  major  class  of  selection  methodologies  could  be  termed  selection 
through  optimization.  In  this  technique,  a  is  chosen  as  the  optimizing  value 
of  some  objective  function.  There  are  two  general  types  of  objective  functions; 
the  likelihood  function  and  estimates  of  MISE.  Duin  (1976)  introduces  the  first, 
which  is  termed  likelihood  cross-validation.  The  objective  function  is  defined  as 

n 

»=i 

where  is  the  estimate  of  /  calculated  with  the  ith  observation  omitted. 
The  parameter  s  is  chosen  to  maximize  L(s).  The  usual  likelihood  function, 
[I  /(X,;s ),  is  not  employed  since  it  typically  leads  to  a  degenerate  choice  of  s 
(i.e.  s  =  0  or  oo).  In  the  case  of  kernel  estimators,  Chow,  Geman  and  Wu 
(1983)  prove  that  if  /  is  bounded  and  of  compact  support  and  h  is  chosen  to 
maximize  L(h),  then  ISE(/*\/)  — ♦  0  a.s.  as  n  — ►  oo.  Here,  ISE  is  the  integrated 
squared  error,  ISE(/*,/)  =  J[fk(x\h)  -  f(x)]2dx.  In  general  circumstances,  the 
restrictions  on  /  are  of  concern,  although  not  so  here. 

In  the  second  method,  an  estimate  of  MISE  is  minimized  with  respect  to  s. 
For  trigonometric  series,  it  is  possible  to  estimate  MISE,  or  its  increments,  di¬ 
rectly;  see  Kronmal  and  Tarter  (1968),  Tarter  and  Kronmal  (1976),  Hart  (1985), 
Diggle  and  Hall  (1986),  an  '  Wahba  (1981).  Rudemo  (1982)  and  Bowman  (1984) 
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introduce  least  squares  cross-validation  (LSCV),  which  has  application  to  a  wide 
range  of  estimators.  The  objective  function  is  given  by 

LSCV(s)  =  i£/(j)(x;.)J<te-  ;£/(<)(*<;.). 

1  =  1  1  =  1 

Rudemo  (1982)  shows  that  LSCV  is  an  unbiased  estimator  of  MISE(/, /)  - 
/  f(x)2dx.  Since  the  last  term  does  not  depend  on  s ,  it  ‘is  hoped  that  minimizing 
LSCV  with  respect  to  s  will  be  like  minimizing  MISE.  Hall  (1983a)  and  Stone 
(1984)  give  results  for  kernel  estimates  concerning  the  behavior  of  fk  when  h  is 
so  chosen.  In  particular,  assuming  only  that  /  is  bounded,  Stone  shows  that 


f[fk{x;h)  -  f{x)}2dx 
f[fk(x;tf)  -  f(x))2dx 


oo, 


where  h  is  the  minimizer  of  LSCV  and  minimizes  ISE. 

Of  all  the  methods  discussed,  LSCV  is  probably  the  most  widely  used,  al¬ 
though  one  should  not  regard  LSCV  as  a  panacea.  Silverman  (1986),  page  51, 
points  out  that  for  kernel  estimation  LSCV  can  lead  to  a  degenerate  choice  of  h  if 
the  observations  are  discretized.  It  is  also  well  recognized  [see  Hart  (1988,1985) 
and  Scott  and  Terrell  (1987)]  that  LSCV  tends  to  substantially  undersmooth 
about  5 %  to  20%  of  the  time.  Nonetheless,  least  squares  cross-validation  is  a 
useful  and  general  tool. 


2.4.7.  Choice  of  Estimator.  The  estimator  to  be  used  in  this  dissertation  is 
the  boundary  kernel  of  Gasser  and  Muller.  The  rationale  for  choosing  a  kernel 
based  estimator  and  specifically  the  Gasser-Muller  boundary  kernel  is  detailed 
below. 

The  histogram  is  not  used  because  it  is  felt  that  it  does  not  convey  informa¬ 
tion  well.  It  is  inherently  rough  and  discrete,  yet  it  is  estimating  an  object  which 
is  continuous  and  smooth.  A  smooth  estimator  is  desired.  Further,  the  rate  of 
convergence  of  MISE  falls  well  below  that  of  other  techniques  examined. 

A  trigonometric  series  is  not  employed  because  one  is  led  to  kernel  repre¬ 
sentations  for  the  estimate  that  use  poor  kernels.  To  answer  the  question  posed 
at  the  enc  :  Subsection  2.4.4,  it  is  better  to  choose  what  seems  an  appropriate 
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kernel;  one  that  leads  to  sensible  estimates  and  has  sensible  properties.  This  is 
not  to  say  that  orthogonal  series  have  been  abandoned  altogether.  In  Section  3,  a 
series  representation  for  the  Gasser-Muller  boundary  kernel  is  obtained.  Rather, 
it  is  to  say  that  it  is  better  to  let  the  kernel  determine  the  orthogonal  series.  Ker¬ 
nels  seem  to  be  more  easily  examined  as  to  their  implications  for  the  estimate. 
Further,  the  kernel  based  orthogonal  series  is  indexed  by  the  bandwidth,  h.  In 
Section  3,  where  the  orthogonal  series  will  be  used  to  construct  score  functions 
for  linear  rank  statistics,  this  will  be  very  convenient.  It  will  be  seen  that  the 
tail  behavior  of  the  score  functions  varies  with  h  so  that  distinct  values  of  h  cor¬ 
respond  to  optimal  scores  for  different  distributions.  If  an  ordinary  orthogonal 
series  were  used,  it  would  be  necessary  to  change  the  series  to  achieve  such  an 
effect.  Since  the  ARMA  method  is  very  nearly  an  orthogonal  series  method  much 
the  same  reasoning  applies. 

The  cut  and  normalize  and  reflection  boundary  kernels  are  not  used  because 
of  their  greater  bias.  If  is  felt  that  it  is  worth  trading  bias  reduction  for  the 
guarantee  of  non-negative  estimates.  The  discussion  of  the  interpretation  of 
negative  estimates  in  Subsection  2.4.4  removes  some  of  the  onus  of  the  situation. 
Further,  one  needs  to  examine  the  potential  uses  of  an  estimate  of  Even 

though  d\(u)  is  a  density,  it  won’t  be  used  for  simulation,  nor  will  probabilities  be 
calculated.  Recalling  <f*(u)’s  interpretation  as  a  likelihood  ratio,  the  important 
feature  of  d\  (u)  is  its  shape-which  regions  are  large  relative  to  others  and  relative 
to  1.  Regions  which  are  negative  are  to  be  interpreted  as  having  little  or  no 
content.  In  these  circumstances,  there  is  far  less  need  to  require  the  estimate 
to  itself  be  a  density  function.  Rather,  it  is  preferable  to  have  an  improved 
estimate  in  terms  of  MISE.  Finally,  there  is  not  a  lot  to  choose  between  Gasser 
and  Muller’s  boundary  kernel  and  that  of  Rice.  Both  have  the  same  asymptotic 
representation  and  broadly  similar  shapes.  It  is  somewhat  more  compact  to  write 
down  the  Gasser-Muller  boundary  kernel  and  so  it  is  selected. 
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3.  ESTIMATION  AND  TESTING 

3.1.  Introduction 

3.1.1.  Introduction.  Results  for  the  boundary  kernel  estimation  of  the  com¬ 
parison  density  and  tests  of  its  uniformity  are  presented  in  this  section.  Subsec¬ 
tion  3.2  describes  results  concerning  the  estimation  phase  of  the  process.  Subsec¬ 
tion  3.2.2  gives  more  traditional  results  for  the  estimator,  including  its  asymptotic 
normality  under  H0  and  its  consistency  under  general  alternatives.  Both  these 
results  are  derived  under  a  shrinking  bandwidth.  The  invariance  properties  of 
the  estimator  are  also  detailed. 

Subsection  3.2.3  defines  a  stochastic  process  based  on  the  boundary  kernel 
estimate.  This  stochastic  process  is  called  the  kernel  density  process.  Under  a 
fixed  bandwidth,  the  weak  convergence  of  this  process  to  a  limiting  Gaussian 
process  is  proved.  A  convenient  representation  for  the  limiting  process  is  given. 
Properties  of  these  processes  are  explored.  They  are  seen  to  be  continuous  with 
probability  1.  The  null  covariance  kernel,  which  is  the  covariance  kernel  of  the 
limiting  process  under  H0,  is  given.  The  quality  of  the  approximation  of  the 
limiting  distribution  under  fixed  h  is  compared  to  that  of  shrinking  h  under 
H0.  A  simulation  study  is  conducted  to  carry  out  this  comparison.  The  results 
indicate  that  the  fixed  h  approximation  is  superior. 

Subsection  3.3  gives  results  concerning  the  testing  phase  of  the  procedure. 
During  the  study  another  estimator  of  the  comparison  density  suggests  itself. 
Subsection  3.3.2  details  the  statistic  <p h  which  is  the  square  of  the  £ 2  norm 
between  the  boundary  kernel  estimator  and  1.  Although  it  is  a  statistic  and 
so  is  not  what  is  sought,  an  analysis  of  its  distribution  leads  to  the  idea  of  the 
components  of  the  kernel  density  process.  These  components  are  similar  in  spirit 
to  those  of  the  Cram£r-von  Mises  and  Anderson-Darling  statistics  described  in 
Subsection  2.2.5.  Subsection  3.3  investigates  the  components  in  depth.  Also  in¬ 
vestigated  are  those  concepts  required  to  define  the  components  such  as  the  null 
covariance  kernel  and  its  eigenvalues  and  eigenfunctions.  A  numerical  procedure 
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to  estimate  the  eigenfunctions  and  eigenvalues  is  given  and  a  check  of  its  accuracy 
performed.  The  properties  of  the  components  are  worked  out.  They  are  seen  to 
be  generalized  Fourier  coefficients  and  linear  rank  statistics.  The  joint  conver¬ 
gence  in  distribution  of  the  sample  components  to  their  limiting  counterparts  is 
proved.  A  small  sample  correction  to  the  means  of  the  components  is  given.  It 
is  seen  that  the  space  spanned  by  the  eigenfunctions  is  of  interest.  Finally,  it  is 
argued  that  a  test  based  on  the  first  M  components  which  gives  equal  weight 
to  each  will  yield  more  fruitful  results  than  the  traditional  statistics  which  are 
weighted  infinite  sums  of  the  components. 

Subsection  3.3.4  investigates  whether  the  space  spanned  by  the  eigenfunc¬ 
tions  contains  the  space  in  which  the  kernel  density  process  resides.  This  con¬ 
dition  is  seen  to  be  related  to  the  positive  definiteness  of  the  null  covariance 
kernel  over  this  space.  An  equivalent  condition  which  is  a  Fredholm  integral 
equation  of  the  first  kind  is  given.  Unfortunately,  it  is  not  possible  to  check 
either  of  these  conditions:  the  equations  are  too  complex.  The  implications  of 
the  eigenfunctions  spanning  or  not  spanning  the  appropriate  space  are  detailed. 

Subsection  3.3.5  introduces  a  new  framework  for  testing  the  components. 
From  this  framework  is  suggested  what  is  called  the  subset  chi-square  test.  The 
traditional  tests-the  chi-square  test  and  the  independent  tests  method  (».e.  test¬ 
ing  each  component  independently)-fit  into  this  framework  as  well.  The  three 
tests  are  compared.  The  subset  chi-square  is  shown  to  be  a  compromise  between 
the  other  two.  It  is  also  seen  to  possess  other  desirable  properties.  It  considers 
the  components  as  groups  not  just  singly.  In  the  case  of  rejection,  it  indicates 
which  components  are  significant.  Finally,  it  lends  itself  well  to  graphical  display. 

Subsection  3.3.6  applies  the  subset  chi-square  test  to  the  components.  The 
test  suggests  an  orthogonal  series  estimate  of  the  comparison  density.  This  or¬ 
thogonal  series  estimate  is  investigated  and  contrasted  to  the  boundary  kernel 
estimator.  The  orthogonal  series  estimate  is  proved  to  be  a  weighted  orthonormal 
series  where  the  weights  are  the  eigenvalues  of  the  null  covariance  kernel.  Subsec¬ 
tion  3.3.7  suggests  alternate  strategies  of  choosing  the  bandwidth  and  truncation 
point.  Also  discussed  are  the  pros  and  cons  of  automatic  selection  criteria  and 
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their  effect  on  the  testing  procedure.  Subsection  3.3.8  summarizes  the  unified 
procedure. 

3.1.2.  Assumptions.  The  various  assumptions  made  on  the  underlying  distri¬ 
butions  have  been  scattered  throughout  the  first  two  sections.  They  are  repeated 
here  for  clarity. 

1.  X\t..., Xm  are  iid  with  distribution  function  F. 

2.  Yu...,Yn  are  iid  with  distribution  function  G. 

3.  The  two  samples  are  independent. 

4.  F  and  G  are  absolutely  continuous  with  densities  /  and  g,  respectively. 

5.  The  quantiles  functions  of  F  and  G ,  and  Q^,  are  continuous. 

6.  If  /  has  support  [ <*/,&/ ]  and  g  has  support  [a?,6?],  then  /  and  g  are  contin¬ 
uous  on  {aj  A  ag,bf  V  bg). 

7.  Let  A^j  =  m/N,  N  =  m  4-  n.  Then  it  is  usually  assumed  that  — ►  Aq 

as  m  A  n  -+  oo  where  0  <  Ao  <  1  but  sometimes  A^  =  Aq  is  assumed.  It 
will  be  pointed  out  where  this  latter  assumption  is  used. 

3.2.  Properties  of  the  Boundary  Kernel  Estimator 

3.2.1.  Introduction.  Subsection  3.2  examines  the  properties  of  the  boundary 
kernel  estimator  of  the  comparison  density  function.  In  Subsection  3.2.2  asymp¬ 
totic  pointwise  results  are  established.  The  asymptotic  normality  of  the  estimator 
under  H0  is  established  as  is  its  consistency.  These  results  are  traditional  in  the 
sense  that  they  occur  as  the  bandwidth  shrinks  to  zero  at  an  appropriate  rate. 
The  invariance  properties  of  the  estimate  are  also  examined.  In  Subsection  3.2.3, 
results  for  the  estimator  are  derived  by  treating  it  as  a  stochastic  process  on  [0, 1]. 
These  results  are  derived  under  the  assumption  of  fixed  bandwidth;  that  is,  the 
bandwidth  does  not  shrink  to  zero  as  the  sample  sizes  increase.  A  process  based 
on  the  kernel  density  estimator  called  the  kernel  density  process  is  defined  and 
its  weak  convergence  is  shown  under  this  condition.  The  results  of  both  these 
subsections  are  unique  because  the  underlying  stochastic  process,  CDjy,  and  its 
limiting  process,  L,  are  not  those  associated  with  iid  random  variables  for  which 
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density  estimation  results  are  usually  derived.  No  such  results  are  known  to  exist 
in  this  framework. 


3.2.2.  Pointivise  Asymptotic  Results.  In  this  subsection,  two  main  point- 
wise  results  are  established  for  the  boundary  kernel  estimate  of  the  comparison 
density.  The  first  is  its  asymptotic  normality  under  H0  and  the  second  is  its  con¬ 
sistency  under  any  alternative.  The  invariance  properties  of  the  estimate  are  also 
investigated.  Start  by  defining  the  boundary  kernel  estimator  of  the  comparison 
density  as 

4M  =  £  \k.  iDN(u) 


w-WN)\ 


V 


where  Ka(w)  is  the  Gasser-Muller  boundary  kernel  and  s  =  s(w,h)  is  given  by 

w/h  for  0  <  w  <  h 

s(tv,h)  =  -  (\  —  w)/h  for  1  —  h  <  tv  <  1 
.  1  otherwise. 

The  function  s(w,h)  chooses  which  of  the  family  of  boundary  kernels  is  appropri¬ 
ate.  It  is  understood  that  when  w  <  h  the  left  hand  boundary  kernel  is  selected 
and  when  w  >  1  —  k  the  right  hand  boundary  kernel  is  selected.  The  remaining 
terms  have  been  defined  in  Subsections  2.2  and  2.3:  Djf(u)  is  the  sample  com¬ 
parison  distribution  function;  N  —  m  +  n  and  iZ,  is  the  rank  of  X,  in  the  pooled 
sample. 

A 

The  following  theorem  concerning  the  asymptotic  normality  of  d^( to)  under 
H0  can  be  proved. 


Theorem  3.2.1.  If  K(t)  is  a  kernel  with  support  [-1,1]  satisfying  K{- 1)  = 
K(  1)  =  0,  K'(-l)  =  K'(  1)  =  0,  K"{- 1)  =  K"[  1)  -  0,  and  |if"(t)|  <  M  for 
some  M  <  oo  and  K(t)dt  =  1  and  if  h  — ♦  0,  (m  A  n)h3  —*  oc  as  m  A  n  -*  oo 
then  under  H0,  d^w)  is  AN(1,  fl\  K(t)2dt)  for  each  w  €  (0,1). 


The  proof  of  Theorem  3.2.1  along  with  the  proofs  to  all  the  theorems  and  lemmas 
stated  in  Sections  3  and  4  is  in  Appendix  B.  The  proof  of  Theorem  3.2.1  is  long 
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and  somewhat  tedious.  Its  basic  approach  is  motivated  by  Chernoff  and  Savage 
(1958).  The  strategy  is  to  write  VTV7i(d/,(ti>)  —  1)  as  the  sum  of  four  terms,  two 
first  order  terms  and  two  second  order  terms.  The  first  order  terms  are  shown 
to  converge  in  distribution  to  the  appropriate  random  variable  while  the  second 
order  terms  are  shown  to  converge  in  probability  to  zero. 

It  should  be  noted  that  since  the  bandwidth,  h ,  is  shrinking  to  zero  that 
boundary  effects  do  not  occur  asymptotically  for  each  w  6  (0, 1).  This  is  so  since 
for  each  fixed  u>,  boundary  effects  are  experienced  only  if  w  <  h  or  w  >  1  —  h. 
Since  h  —*  0,  these  effects  will  cease  at  some  point.  In  the  proof  of  Theorem 
3.2.1,  one  can  ignore  the  Gasser-Muller  modification  of  K  and  concentrate  solely 
on  K  itself. 

Comparing  Theorem  3.2.1  with  the  types  of  conditions  one  normally  sees  for 
kernel  estimators  in  the  iid  case,  one  notes  (a)  additional  conditions  on  K  and 
(b)  (mAn)ft3  — ♦  oo  instead  of  Mh  — »  oo  (for  a  random  sample  of  size  M  in  the  iid 
case).  This  comparison  is  of  interest  since  under  H0  one  might  reasonably  expect 
the  normalized  ranks,  R\/N , . . . ,  I2m/A  ,  to  behave  like  a  random  sample  from  a 
uniform,  <7(0, 1),  distribution.  Indeed,  when  viewed  from  a  process  point  of  view, 
they  have  the  same  limiting  empirical  process  up  to  the  multiplicative  constant 
y .  During  the  proof  of  Theorem  3.2.1,  one  sees  that  the  first  order  terms 
do,  in  fact,  behave  like  a  kernel  smoothing  of  iid  U(0, 1)  random  variables.  For 
these  terms,  the  condition  (m  A  n)h  — ♦  oo  is  sufficient  for  asymptotic  normality. 
The  extra  conditions,  both  on  K  and  the  rate  at  which  h  goes  to  0,  are  necessary 
to  show  that  the  second  order  terms  converge  to  zero  in  probability. 

Although  interesting  in  its  own  right,  it  is  not  clear  how  Theorem  3.2.1 
might  be  used  to  test  the  null  hypothesis.  One  certainly  wouldn’t  base  a  test  on 
the  kernel  estimate  at  a  single  point.  It  should  be  possible  to  extend  Theorem 
3.2.1  to  show  the  joint  convergence  in  distribution  of  to  a 

multivariate  normal  distribution  for  fixed  k  and  . . .  ,u>fc.  However,  at  this 
point  one  encounters  a  practical  problem.  To  test  H0,  how  would  one  choose  k 
and  mm  . . .  u>fc?  There  perhaps  is  some  way  to  choose  these  points  in  an  optimal 
fashion,  however  such  a  scheme  would  certainly  depend  on  the  true  and  unknown 


values  of  the  comparison  density.  Such  statistics  would  also  not  fit  the  criteria 
outlined  in  Section  1  and  Subsection  2.2.6. 

Pursuing  for  the  moment  a  test  based  on  k  values,  a  natural  statistic  to  use 
would  be  of  the  form 

jD4to)-i]2- 

*=1 

Letting  k  grow  large,  a  natural  analogue  to  this  statistic  is 
(3.2.1)  /  [4(u/)  -  1) 2dw. 

It  is  only  appropriate  that  barring  some  reason  to  look  at  a  specific  and  fixed  set 
of  w’s  that  all  of  them  should  be  considered.  Statistics  such  as  (3.2.1)  cannot  be 
handled  by  pointwise  convergence  in  distribution  results  such  as  Theorem  3.2.1. 
Instead,  weak  convergence  results  are  required.  These  are  treated  in  Subsection 
3.2.3.  Although  (3.2.1)  is  also  a  statistic  and  so  not  does  not  fit  the  criteria 
which  have  been  outlined,  a  study  of  (3.2.1)  leads  to  a  methodology  which  does. 
Construction  of  tests  is  taken  up  in  Subsection  3.3. 

Another  type  of  asymptotic  result  which  is  often  of  interest  is  consistency. 
The  following  theorem  can  be  proved  regarding  the  consistency  of  d^(w). 

Theorem  3.2.2.  Let  h  — *  0,  (m  A  n)h2  — ►ooasmAn— *•00  and  let  K  have 
support  [—1,1]  with  K  differentiable  on  (—1,1).  If  =  A0  is  not  a  function 

of  N  or  d[jf}  converges  to  d0  uniformly,  then  d^w)  d0(tv)  as  m  A  n  — ►  00, 

otherwise  d^(w)  —  (1  jh)  Jq  K[(w  —  u  )/h\d^(u)du  0  iwmAn-*  00. 

Consistency  is  generally  regarded  as  a  good  property.  Theorem  3.2.2  states  that 
if  h  tends  to  zero  at  the  appropriate  rate  then  d^(w)  will  indeed  be  consistent. 

Again,  it  is  interesting  to  compare  Theorem  3.2.2  with  results  from  ordinary 
kernel  density  estimation.  In  the  iid  case,  pointwise  consistency  is  achieved  under 
the  conditions  h  — »  0,  Mh  — ♦  00  and  uniform  consistency  if  h  — ♦  0,  Mh 2  — ♦  00. 
The  extra  conditions  on  A^j  and  uniform  convergence  result  from  the  fact  that 
one  is  approximating  a  function  which  itself  is  changing  with  N.  Hence,  one  needs 
either  that  it  isn’t  changing  with  N  (i.e.  A(^j  =  Aq)  or  uniform  convergence.  In 


i 


65 


a  very  real  sense,  the  proofs  of  theorems  in  the  iid  case  are  much  easier  because 
a  good  deal  more  is  known  than  simply  the  weak  convergence  of  the  empirical 
process.  It  is  not  surprising,  then,  that  results  obtained  in  the  iid  case  should  be 
stronger. 

There  is  one  last  property  of  the  kernel  estimate  which  should  be  examined. 
This  is  the  question  of  invariance.  In  this  context,  invariance  refers  to  whether  it 
makes  a  difference  which  of  the  two  samples  is  called  the  first  sample.  Expanding 
the  notation  for  this  purpose,  let  D%( w)  =  FQs(w)  be  the  comparison  distribu¬ 
tion  function  when  the  population  with  distribution  function  F  is  called  the  first 
sample.  Similarly,  let  D^(w)  =  G(  •))  be  the  comparison  distribution  func¬ 
tion  when  the  population  with  distribution  function  G  is  called  the  first  sample. 
Let  dj^(tu)  and  d^( w)  be  the  corresponding  comparison  density  functions.  Let  A 
be  the  weight  given  to  distribution  function  F  (».e.  the  probability  of  choosing 
population  F  or  the  fraction  of  the  total  sample  represented  by  population  F). 
Parzen  (1983)  shows  that  df  and  df  satisfy 

Adf(w)  +  (1  —  A)<fGA(iu)  =  1 

for  all  w  €  [0, 1).  The  boundary  kernel  estimates  of  these  quantities  satisfy 


x(N)^h(w)  +  (1  “  \N))^h(w) 


— ♦  1  as  N  — ►  oo  and  Nh  — *  oc 


for  all  w  €  (0, 1],  where  s  =  s(w,h)  and  Sx  is  the  rank  of  Yx  in  the  pooled  sample. 
The  last  sum  is  a  rectangular  sum  approximation  to  the  integral  jj  Jq1  Jf5[(u;  - 
u)/h)du  which  is  1.  The  above  convergence  is  shown  as  part  of  the  proof  of 
Theorem  3.2.1  in  Appendix  B.  Asymptotically,  the  boundary  kernel  estimate 
obeys  the  invariance  property  of  the  population  quantities. 


3.2.3.  The  Kernel  Density  Process.  In  this  subsection,  the  kernel  density 
process  is  defined  and  a  theorem  concerning  its  weak  convergence  under  a  fixed 


bandwidth  is  stated.  The  implications  of  a  fixed  bandwidth  are  discussed  and 
the  covariance  kernel  of  the  limiting  process  under  H0  is  found  and  investigated. 
The  kernel  density  process,  KDPjy  &(iv),  is  defined  as 

(3.2.2)  KDP^aM  =  i  jf K,  <iCDw(u), 

where  Ka  is  a  boundary  kernel  and  CDjy  is  as  previously  defined.  The  process 
KDPjy th{w)  is  simply  a  centered  and  scaled  version  of  d^(w),  that  is, 

KDP„,k(w)  =  -  4H|, 

where  d^(w)  =  Jq  Ka\( w  ~  u)/h\di^(u)du  is  a  smoothing  of  the  comparison 
density. 

To  allow  for  some  flexibility,  the  theorem  concerning  the  weak  convergence 
of  KDP  ff  h  is  stated  for  any  boundary  kernel  which  obeys  the  following  regularity 
conditions. 


Regularity  Conditions.  Let  Ka(w)  be  a  family  of  boundary  kernels  indexed  by 
s  €  [0,1].  It  is  required  that  the  derivative  of  Ka(w)  with  respect  to  w  exist  for 
each  s  and  that  K'3(w)  be  continuous  on  1R.  Define 


0(6)  =  sup 
\x-y\<6 


Tr)h 


where  s  —  s(x,h)  and  s'  =  s(y,h).  It  is  required  that  0(6)  — »  0  as  6  — ►  0. 


Lemma  3.2.1  states  the  conditions  under  which  the  Gasser-Muller  boundary  ker¬ 
nel  satisfies  the  Regularity  Conditions. 

Lemma  3.2.1.  Let  K(t)  be  a  differentiable  kernel  with  support  [—1,1]  satisfying 
(1)  K  is  continuous  on  Rand  (2)  K1  is  continuous  on  R.  Then  the  Gasser-Muller 
boundary  kernel  based  on  K  satisfies  the  Regularity  Conditions. 


From  Lemma  3.2.1  it  follows  that  the  Gasser-Muller  boundary  kernel  cannot 
be  based  on  just  any  kernel.  For  instance,  the  popular  Epanechnikov  kernel, 
which  is  a  quadratic  function,  cannot  be  used.  However,  the  biweight  kernel, 
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which  is  a  quartic  kernel,  can  be  used.  The  biweight  will  be  used  throughout 
this  work  when  a  specific  kernel  is  required.  That  the  Epanechnikov  kernel 
cannot  be  used  is  not  of  great  concern.  From  a  practical  standpoint,  there  is  not 
much  to  choose  from  within  the  class  of  kernels  having  support  [—1,  l]  which  are 
probability  density  functions,  symmetric  about  zero  and  continuous  on  1R.  Very 
similar  results  can  be  obtained  from  these  kernels  by  altering  the  bandwidths. 
The  differences  are  in  higher  order  smoothness  properties  which  are  difficult  to 
detect  visually.  However,  the  smoother  kernel  is  required  here  for  the  proofs  to 
go  through. 

The  limiting  process  of  KDP^y^  is  KDP^  which  is  defined  as 

KDPk(«o  =  ~  /  *;  Pp)  £(«)<*«, 

where  L  is  the  limiting  process  of  CD ff  as  defined  in  Subsection  2.3.3.  Before 
proving  that  KDP^  is,  in  fact,  the  limiting  process  of  KDPjy  ^,  its  existence  must 
be  shown.  One  must  show  that  the  defining  integral  equation  has  some  meaning. 
This  result  is  given  by  Lemma  3.2.2. 

Lemma  3.2.2.  The  sample  paths  of  the  process  KDP^  exist  and  are  continuous 
with  probability  1. 

Now  that  the  needed  Regularity  Conditions  have  been  established  and  the 
limiting  process  exists  with  probability  1,  the  stage  is  set  for  the  main  result  of 
this  subsection.  Theorem  3.2.3  gives  the  weak  convergence  of  KDPjy^  to  KDP^. 

Theorem  3.2.3.  If  the  boundary  kernel  satisfies  the  Regularity  Conditions  then 
for  fixed  h  one  has 

KD?„th  =>  KDP* 
in  (C[0, 1],  Cp, />)  os  n  A  m  — ►  oo. 

The  triple  (CjO,  l],Cp,p)  is  a  probability  triple.  The  set  of  continuous  functions 
on  [0,  l]  is  C[0,  l];  p  is  the  sup-norm;  and  Cp  is  the  o-field  generated  by  the  open 
balls. 
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There  are  several  aspects  of  Theorem  3.2.3  which  merit  attention.  First  is 
that  the  limiting  process  is  basically  a  kernel  smoothing  of  the  process  L(u). 
The  second  is  that  the  result  refers  to  fixed  bandwidths.  This  is  in  contrast 
to  standard  kernel  density  estimation  results  which  require  the  bandwidth  to 
shrink  to  zero.  However,  one  should  note  that  although  some  work  has  been 
done,  it  is  also  not  typical  in  ordinary  kernel  density  estimation  to  treat  the 
estimator  as  a  stochastic  process  and  to  investigate  its  weak  convergence.  Bickel 
and  Rosenblatt  (1973)  have  done  the  major  work  in  this  area.  They  assume  the 
data  is  iid  according  to  a  density  /  and  that  /  satisfies  various  conditions  such 
as  it  is  twice  differentiable  with  a  bounded  second  derivative.  They  then  treat 
the  kernel  density  estimate  on  a  bounded  interval  as  a  process  and  obtain  weak 
convergence  results  under  a  shrinking  bandwidth. 

There  are  several  arguments  that  can  be  employed  to  justify  fixing  the  band¬ 
width  in  Theorem  3.2.3.  The  rationale  in  any  context  for  letting  the  bandwidth 
tend  toward  zero  is  tc  remove  the  bias  of  the  estimator.  In  terms  of  testing, 
the  comparison  density  is  uniform  under  the  null  hypothesis.  It  is  not  hard  to 
show  that  the  estimator,  d^(w),  is  asymptotically  unbiased  under  H0  for  fixed 
bandwidths.  Observe: 


since  each  72,  is  marginally  uniform  over  under  H0.  Hence, 

E[4MI  (hP)  *>  = 1 

A 

as  N  —*■  oo.  Thus  d ^  is  asymptotically  unbiased  for  fixed  h.  From  an  imple¬ 
mentation  standpoint,  knowing  results  like  (m  A  n)h3  — *  oo  is  not  much  help  in 
choosing  a  bandwidth.  Recall  the  discussion  of  Subsection  2.4.6.  One  is  either 
left  to  judge  the  fit  by  graphical  standards  or  by  a  criterion  such  as  least  squares 
cross-validation.  Essentially,  h  controls  the  tradeoff  between  bias  and  variance. 
Letting  h  — ►  0  may  make  things  work  out  asymptotically,  but  is  of  little  help  in 


t 
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fixed  samples.  Finally,  one  can  argue  that  the  asymptotic  approximation  derived 
for  fixed  h  is  superior  to  that  derived  for  h  —*  0.  To  this  end,  an  examination 
of  the  exact  differences  in  the  limiting  distributions  should  prove  useful.  Under 
H0,  each  has  the  same  limiting  mean  but  the  variances  are  not  the  same.  The 
variance  for  h  — ►  0  is  given  in  Theorem  3.2.1;  a  variance  formula  from  Theorem 
3.2.3  needs  to  be  derived. 

Let  CfrlvyW)  be  the  covariance  kernel  of  KDPj,.  The  covariance  kernel  is  de¬ 
fined  as  C^Vyw)  =  E[KDP^(v)KDP^(tt;)]  which  has  support  on  the  unit  square. 
The  covariance  kernel  of  KDP/,(tu)  under  H0  can  be  derived  and  the  result  is 
given  as  Lemma  3.2.3. 

Lemma  3.2.3.  The  covariance  kernel  of  \Ao/(l  ~  Ao)KDP/,(u/)  under  H0  is 

(3.2.3)  C*(„, .)  =  £  j[‘  if,  (^)  ify  (V)  *-  »• 

where  s  =  s(v,h)  and  s'  =  s(w,h). 

Although  the  formula  for  C^(v,w)  looks  somewhat  messy,  it  is  possible  to 
obtain  a  closed  form  expression  for  it.  No  numerical  integration  is  necessary. 
This  formula  is  derived  in  Appendix  B. 

From  the  definition  of  the  covariance  kernel,  it  is  obvious  that  Ch(v,w)  = 
C/j(ui,t/).  Recall  the  relation  of  the  boundary  kernel  for  the  left  and  right  end¬ 
points  from  Subsection  2.4.3,  K*,(t)  =  Kla[-t).  Using  this  relation  and  a  change 
of  variable  in  (3.2.3)  it  can  be  shown  that  C/^VjU/)  also  satisfies 

CA(v,u;)  =  Ch{  1  -  u,l  -  u;). 

These  symmetries  are  visible  in  Figures  8  through  10.  Each  of  these  figures 
presents  four  perspective  plots  of  the  covariance  kernel  under  H0.  Figure  8  pic¬ 
tures  Ch(v,w)  for  h  =  0.5;  Figure  9  for  h  =  0.3;  and  Figure  10  for  h  =  0.1.  The 
graphs  are  truncated  at  ±3  so  that  details  are  not  obscured  by  a  large  dynamic 
range.  In  each  figure,  the  covariance  changes  from  its  base  level  of  -1  only  if 
the  two  points  are  within  two  bandwidths  of  one  another.  For  a  large  bandwidth 
(Figure  8,  h  =  0.5),  one  observes  a  very  smooth  tunnel-like  appearance.  For  a 


Fig.  8.  Perspective  plots  of  the  covariance  kernel  of  the  kernel  density  process 
under  H0.  The  bandwidth  is  h  —  0.5.  The  perspective  of  Figure  (a)  is  (5,  —4, 10); 
Figure  (6)  is  (.5,  -4,10);  Figure  (e)  is  (—2, -3,10);  and  Figure  (d)  is  (0,  — .5,0). 


Fig.  9.  Perspective  plots  of  the  covariance  kernel  of  the  kernel  density  process 
under  H0.  The  bandwidth  is  h  =  0.3.  Th  perspective  of  Figure  (a)  is  (5,  —4, 10); 
Figure  (6)  is  (.5,  -4, 10);  Figure  (c)  is  (-2,  -3, 10);  and  Figure  (d)  is  (0,  -.5,0). 
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There  are  two  distinctions  between  (3.2.4)  and  (3.2.5).  For  w  near  the  bound¬ 
ary,  (3.2.5)  uses  the  boundary  kernei  whereas  (3.2.4)  does  not.  Equation  (3.2.5) 
also  has  an  additional  term.  Since  (3.2.5)  approaches  (3.2.4)  as  h  tends  to  zero, 
one  could  easily  view  (3.2.5)  as  containing  correction  factors  for  fixed  h.  Since 
h  is  fixed  for  finite  samples,  one  might  suppose  that  (3.2.5)  would  yield  an  im¬ 
proved  approximation  of  the  small  sample  distribution  function  by  the  limiting 
distribution  function.  A  simulation  confirms  this  supposition. 

The  simulation  consists  of  1000  replications  for  each  of  five  choices  of  m 
and  n.  The  two  samples  are  each  drawn  from  a  17(0,1)  population,  hence  H0 
is  true.  The  bandwidth  is  given  by  h  =  0.75(m  A  n)--3  which  satisfies  the 
conditions  of  Theorem  3.2.1.  For  each  replication,  the  comparison  density  is 
found  for  tv  —  0.1, 0.2, 0.3, 0.4, 0.5.  Values  above  0.5  are  not  needed  by  symmetry 
with  those  below  0.5.  The  one  sample  Kolmogorov-Smirnov  statistic  comparing 
the  fit  of  the  data  to  each  normal  approximation  is  then  calculated  for  each 
sample.  The  Kolmogorov-Smirnov  statistic  is  a  measure  of  the  goodness  of  fit  of 
the  sample  and  asymptotic  distributions.  The  results  are  presented  as  Table  3. 
The  Kolmogorov-Smirnov  statistics  based  on  the  fixed  bandwidth  approximation 
are  less  than  those  for  shrinking  bandwidth  in  all  but  one  case,  n  =  m  =  100 
and  u>  =  0.1.  The  difference  in  this  case  is  at  the  second  decimal  place  and  is 
certainly  statistically  insignificant.  The  overall  impression  from  Table  3  is  that 
the  Kolmogorov-Smirnov  values  for  the  fixed  h  approximation  are  substantially 
smaller  than  those  for  shrinking  h.  The  lower  values  for  the  fixed  h  approximation 
would  imply  that  this  is  the  superior  approximation. 

The  conclusion  to  draw  from  these  remarks  is  that  there  are  very  good 
reasons  to  derive  asymptotic  results  for  fixed  rather  than  shrinking  bandwidths. 
The  kernel  density  process  forms  the  basis  of  tests  of  the  null  hypothesis.  These 
tests  are  discussed  in  Subsection  3.3. 

3.3.  Tests  of  the  Null  Hypothesis 

3.3.1.  Introduction.  In  this  subsection,  tests  of  the  null  hypothesis, 
H o'-dx(tv)  =  1,  are  examined.  The  first  test  looked  at  is  based  on  a  statistic, 
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Table  3 

Comparison  of  the  fit  of  the  small  sample  distribution  of  the 
Gasser-Muller  boundary  kernel  estimate  of  the  comparison 
density  under  H0  to  its  limiting  distribution  with  fixed  and 
shrinking  bandwidth.  The  values  in  the  table  are  one  sample 
Kolmogorov- Smirnov  statistics. _  _ 


w 

m 

n 

0.1 

0.2 

0.3 

0.4 

0.5 

Fixed  Bandwidth 

20 

20 

3.45 

1.60 

0.74 

0.99 

0.78 

50 

20 

3.70 

1.57 

0.71 

1.43 

1.23 

50 

50 

1.52 

0.81 

1.27 

0.61 

1.17 

50 

100 

1.30 

0.70 

0.56 

1.05 

1.42 

100 

100 

1.73 

0.67 

1.00 

0.79 

0.60 

Shrinking 

Bandwidth 

20 

20 

4.31 

2.19 

2.03 

1.90 

1.97 

50 

20 

4.39 

2.52 

2.39 

2.05 

2.26 

50 

50 

1.65 

2.08 

1.93 

1.81 

1.88 

50 

100 

1.39 

1.57 

1.52 

1.98 

1.96 

100 

100 

1.70 

1.82 

1.94 

1.14 

1.72 

<p2N  h,  which  is  a  scaled  version  of  the  square  of  the  C2  norm  between  d/,  and  1. 
Since  <p 2N  ^  is  a  statistic,  it  does  not  fit  the  criteria  required  for  a  testing  proce¬ 
dure.  However,  it  leads  to  the  concept  of  components  similar  to  those  discussed 
in  Subsection  2.2.5.  These  components  form  the  basis  of  the  testing  procedure. 
They  are  investigated  in  depth  in  Subset**'-  ,.3.  Their  properties  are  explored 
and  numerical  methods  for  calculating  the  eigenfunctions  and  eigenvalues  upon 
which  they  are  based  are  examined.  Subsection  3.3.4  looks  at  the  question  of 
whether  the  eigenfunctions  form  a  complete  basis  for  the  spaces  in  which  KDP 
and  KDP/j  reside.  This  raises  issues  in  terms  of  orthogonal  decompositions  of 
<f/i  by  the  eigenfunctions.  Subsection  3.3.5  introduces  a  new  test,  the  subset  chi- 
square  test,  which  is  applied  to  the  components  in  Subsection  3.3.6.  Subsection 
3.3.6  also  introduces  an  orthogonal  series  estimator  based  on  the  components. 
The  relation  of  the  boundary  kernel  estimator  and  this  orthogonal  series  estima¬ 
tor  is  investigated.  Subsection  3.3.7  provides  recommendations  for  the  choice  of 
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bandwidth.  Finally,  Subsection  3.3.8  summarizes  a  unified  technique  of  estima¬ 
tion  and  testing  which  satisfies  all  the  required  criteria. 

3.3.2.  The  Statistic  In  this  subsection,  the  statistic  ^  ‘8  defined. 

Its  limiting  distribution  is  found  and  a  possible  representation  for  this  limiting 
distribution  in  terms  of  components  is  also  examined.  Define  <p2N  ^  by 


which  has  already  been  briefly  mentioned  in  Subsection  3.2.2.  Note  that  1  is 
subtracted  from  d ^  and  not  .  The  general  statistic  is  not  of  interest  because 
it  is  desired  to  test  the  null  hypothesis  and  so  the  mean  under  H0  is  subtracted. 
This  will  be  the  case  throughout  this  subsection.  Although  the  general  weak 
convergence  of  KDPyy  ^  was  shown  in  Subsection  2.2.3,  for  construction  of  tests 
one  only  needs  the  weak  convergence  under  H0.  For  clarity,  the  process  KDPojy^, 
which  is  defined  as 

KDPo*,*  =  %/W[4H  -  1|, 

is  introduced  and  will  be  referred  to  as  the  null  kernel  density  process.  The 
process  KDPo/y  ^  equals  KDPjy^  unde:  H0  and  so  converges  weakly  under  H0. 
Under  alternatives  there  is  no  such  result:  the  process  will  become  unbounded 
as  N  increases  because  it  is  incorrectly  centered.  This  too  is  desirable  as  it  is 
indicating  that  the  null  hypothesis  is  false.  It  is  now  possible  to  rewrite  <p2N  h  as 

(•W(1  -  W)  Jo  KDPo^W2^. 

The  statistic  <p2N  h  is  a  normalized  estimate  of  Pearson’s  < p2  distance  measure 
which  can  be  written  [cf.  Eubank,  LaRiccia,  and  Rosenstein  (1987))  as 

The  initial  claim  for  <p2N  ^  is  that  it  converges  in  distribution  to  the  random 
variable 

*  =  ^T0J!KDPk{wfdw 


V2  =  f  [rf(JV)(«>)  -  l]2*"- 
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under  H0.  This  convergence  in  distribution  can  be  demonstrated  by  results  for 
functionals  of  stochastic  processes  which  are  known  to  converge  weakly.  It  is  not 
hard  to  show  this  functional  is  continuous  [see  Ruymgaart  (1988),  page  54];  it  is 
also  measurable  ( Cp ,  8)  where  B  is  the  Borel  a-field.  Theorem  3.11  of  Ruymgaart 
may  be  used  to  establish  the  fact  that  <Pjj  ^  ►  <p\-  Having  found  the  limiting 

random  variable,  <p its  distribution  needs  to  be  established. 

The  distribution  of  quantities  such  as  has  been  examined  in  the  litera¬ 
ture.  Its  behavior  is  similar  to  that  of  the  Cram£r-von  Mises  statistic  defined  in 
Subsection  2.2.5.  One  would  like  to  apply  a  technique  similar  to  that  applied  by 
Durbin  and  Knott  (1972)  to  the  Cram£r-von  Mises  statistic.  That  is,  one  would 
like  to  represent  by 

(3.3.1) 

i=  i 

where  Z\,  Z2, . . .  are  iid  N(0,1)  random  variables  and  {0^}  satisfies  6*  >  0  and 
^y  <  00  •  The  details  of  the  basis  of  this  representation  can  be  found  in 
Shorack  and  Wellner  (1986).  The  Zy’s  are  known  as  components  of  the  null 
kernel  density  process  and  the  0y’s  will  be  seen  to  be  eigenvalues  of  Ch(v,w). 
The  details  of  the  construction  of  the  components  are  not  needed  here  but  will  be 
discussed  in  Subsection  3.3.3.  The  exact  meaning  of  the  in  equation  (3.3.1) 
is  subject  to  question.  Under  one  set  of  conditions,  it  refers  to  ‘distributed  as’. 
However,  under  another  set  of  conditions,  the  definitions  of  KDPojy  a  a°d  KDP^ 
need  to  be  modified  to  be  their  projection  onto  an  appropriate  subspace.  This 
projection  is  then  substituted  for  the  original  process  and  the  results  hold.  For 
the  moment,  these  concepts  and  definitions  are  left  intentionally  vague.  They 
will  be  discussed  in  depth  in  Subsection  3.3.4. 

The  statistic  h  motivates  the  introduction  of  components  as  a  natural 
consequence  of  investigating  the  distribution  of  the  limiting  random  variable.  It 
also  motivates  a  detailed  study  of  C^(v,  u>).  From  a  practical  standpoint  the  two 
interpretations  do  not  matter  much.  However,  which  scenario  holds  will  change 
certain  interpretations  and  wordings.  In  the  next  subsection,  the  components 
will  be  seen  to  be  of  greater  interest  than  the  statistic  which  initially  motivates 
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them. 

3.3.3.  Components  of  the  Kernel  Density  Process.  In  this  subsection  the 
components  of  the  kernel  density  process  which  were  introduced  above  are  de¬ 
fined.  The  basic  properties  of  C^(t>,  w)  are  given  first  as  they  will  be  needed 
throughout.  In  order  to  define  the  components  it  is  necessary  to  find  the  eigen¬ 
functions  and  eigenvalues  of  C^(v,u;).  It  is  observed  that  the  eigenfunctions  and 
eigenvalues  must  be  found  numerically.  A  method  for  doing  so  is  suggested.  The 
resulting  approximations  are  examined  graphically.  Finally,  various  interpreta¬ 
tions  of  the  components  are  explored.  They  are  seen  to  be  both  generalized 
Fourier  coefficients  and  linear  rank  statistics. 

Before  starting  on  a  discussion  of  eigenfunctions,  eigenvalues,  and  compo¬ 
nents  a  few  of  the  properties  of  C\(v,  w)  are  given.  These  will  be  needed  through¬ 
out  to  establish  the  properties  of  these  other  objects  of  interest.  These  properties 
of  C^(u,  xv)  are  stated  as  Lemma  3.3.1. 

Lemma  3.3.1.  The  covariance  kernel  C^(v,w)  satisfies  the  following: 

i.  Ch(v,w)  is  continuous  on  the  unit  square, 

‘i-  fo  fo  Ch(v,w)2dvdw  <  oo, 
iii.  fo  Ch(v.v)dv  <  OO. 

The  function  <^(v),  which  is  defined  on  [0,1],  is  said  to  be  an  eigenfunction 
of  Cj,(v,u;)  and  is  said  to  be  the  associated  eigenvalue  if  4>^{v)  £  0  and 

[  4>h(v)Ch{v,w)dv  =  $h<t>h{w), 

Jo 

for  all  w  €  [0,  l].  Shorack  and  Wellner  (1986),  page  207,  give  a  list  of  results  for 
eigenfunctions  and  eigenvalues.  Among  these  are: 

1.  The  eigenvalues  are  at  most  countable  in  number. 

2.  Corresponding  to  any  non-zero  eigenvalue  there  are  at  most  a  finite  number 
of  linearly  independent  eigenfunctions;  the  maximal  such  number  is  called 
the  multiplicity  of  the  eigenvalue. 

3.  Let  {0y }  be  an  enumeration  of  the  nonzero  eigenvalues  with  each  eigen¬ 
value  appearing  as  many  times  as  its  multiplicity.  Then  the  set  {4>^( v )}  of 
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eigenfunctions  may  be  assumed  to  be  orthonormal. 

4.  The  eigenvalues  {8^}  satisfy  Jq  C^(v,v)dv  =  tf- 

The  properties  of  Ch(v,w)  given  in  Lemma  3.3.1  imply  several  additional  ones 
involving  both  C^( t>,tu)  and  the  eigenfunctions.  These  are  given  as  Lemma  3.3.2. 

Lemma  3.3.2.  The  eigenfunctions,  4>^[w),  and  the  null  covariance  kernel, 
Ch(v,w),  satisfy: 

i.  The  eigenfunctions,  <f>^(w),  are  continuous  on  [0,1], 

ii.  Ch{v,  w)  ««  positive  semi- definite, 

iii.  Ch(v,w)  =  w),  where  the  infinite  series  converges  both  ab¬ 

solutely  and  uniformly. 


The  covariance  kernel,  C^(v,tn),  is  said  to  be  positive  semi-definite  if 


ff 


g(v)Ch(v ,w)g(w)dvdw  >  0 


for  all  g  €  £2(0,  1]  with  g  £  0.  It  is  said  to  be  positive  definite  if  the  >  0 
can  be  replaced  by  >  0.  These  properties  will  all  be  of  use  at  one  point  or 
another.  For  now  the  discussion  turns  to  actually  calculating  the  eigenfunctions 
and  eigenvalues. 

The  form  of  C^(u,u;)  as  given  in  equation  (3.2.3)  involves  the  boundary 
kernel  in  a  very  complicated  way.  It  is  probably  too  much  to  expect  to  be  able  to 
derive  analytic  expressions  for  the  eigenfunctions  and  eigenvalues.  This  is  indeed 
the  case:  a  numerical  solution  is  needed.  The  route  taken  is  to  approximate  the 
defining  integral  by  Simpson’s  rule  and  to  convert  the  problem  to  an  ordinary 
matrix  eigenvalue  problem.  Such  discretized  approximations  are  well  known  in 
the  literature;  see,  for  example,  Ahu4s,  d’ Almeida,  Chatelin  and  Telias  (1982). 

It  is  desired  to  find  <f>^ (v)  and  to  solve  the  integral  equation 

(3.3.2)  J  <t>lJ{v)Ch(v,w)4>i{rv)dvdxv  = 

where  is  Kronecker’s  delta.  Equation  (3.3.2)  is  approximated  by  a  two  dimen¬ 
sional  Simpson’s  rule  at  the  points  x '  =  (0,  if  l, 2/1, . . .  ,1/1)  with  1  even.  Define 


&  —  (l, 4,2,4, ...  ,2,4, 1), 
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<4,  =  W*(o),^(i/i) . *?((//)), 

and 

^  =  [ck((.-l)/l,(j-l)/0),  + 

The  Simpson’s  rule  approximation  to  equation  (3.3.2)  is 

g j2vhjDKhDvhi  =  ^  &iji  *>J  =  1>2,3, . . . , 

where  D  =  diag(d).  Grouping  these  together  for  i,j  =  1, ...,/  +  1  yields 

±V'kDKhDVk  =  e», 

where  O/j  =  diag(0j*,  ...,0^)  and  V ^  . . . , w^+1].  There  is  also  an  or¬ 

thogonality  condition  on  the  <^’s;  namely 

/  <fij(w)tf(w)dw  =  6ijt  ij  =  1, 2, 3, . . . 

Jo 

The  orthogonality  condition  is  approximated  by 

^jvk*^vhj  =  ^ij >  *'»  J  ~  1  >  2, 3, . . . , 

so  in  order  to  approximate  the  first  /  +  1  eigenfunctions  and  eigenvalues,  the 
following  system  must  be  solved  for  and  0^: 

~V'hDKhDVh  =  6*, 

lt<DVh  =  /,+  I. 

Letting  S ^  this  system  is  equivalent  to 

SA[^D1/2^/.£,1/2jsA  =  0A, 

s'ksk  =  /,+1. 

This  last  system  is  an  ordinary  symmetric  eigenvalue  problem  which  can  be 
handled  numerically.  One  solves  it  for  5/,  and  ©/,  and  then  finds  from  S^. 

Since  the  covariance  kernel  is  so  nicely  behaved,  one  expects  Simpson’s  rule 
to  perform  well  in  this  case.  One  would  also  expect  the  eigenvalues  to  be  better 


79 


estimated  than  the  eigenfunctions;  the  former  are  simple  scalars  whereas  the 
latter  are  functions.  This,  however,  is  a  much  more  desirable  state  than  the 
alternative.  The  eigenfunctions  appear  only  in  integrals  where  individual  error 
is  damped  down,  whereas  it  is  often  necessary  to  divide  by  the  eigenvalues  in 
which  case  error  could  produce  large  effects.  In  considering  a  choice  for  /,  one 
should  try  to  make  it  rather  larger  than  the  highest  order  eigenfunction  one  is 
considering  using.  Otherwise,  the  approximation  may  be  at  too  few  points  to  get 
a  good  fix  on  the  function. 


Applying  this  technique  and  using  the  biweight  kernel  and  l  =  88,  Figures 
11  through  13  are  produced.  Each  figure  presents,  for  a  different  h,  the  first  four 
approximated  eigenfunctions.  Figure  11  is  constructed  with  h  —  0.5;  Figure  12 
with  h  =  0.3;  and  Figure  13  with  h  =  0.1.  Generally,  the  jttl  eigenfunction  has 
a  similar  shape  for  each  bandwidth  up  to  an  arbitrary  sign.  Changing  the  band¬ 
width  tends  to  change  the  sharpness  or  peakedness  of  the  functions.  Notice  that 
they  are  oscillatory;  the  jth  eigenfunction  has  j  zero  crossings  in  (0,1).  Figure 
14  presents  the  estimated  eigenvalues  for  these  three  bandwidths.  The  values  all 
decay  to  zero  as  required  considering  they  sum.  The  larger  the  bandwidth,  the 
more  quickly  they  decay  to  zero. 

As  a  check  on  how  well  the  numerical  approximation  works,  one  can  compare 
the  sum  of  the  estimated  eigenvalues  with  /Ql  C/,(v,  v)dv.  These  two  values  should 
be  comparable.  The  sum  is  truncated,  but  considering  the  rate  at  which  the 
eigenvalues  are  decreasing  this  effect  should  be  very  small.  Table  4  presents 
this  comparison  for  five  bandwidths,  h  =  0.5, 0.4, 0.3, 0.2,  and  0.1.  The  true 
value  is  computed  using  Simpson’s  rule  at  1001  points.  The  estimated  values 
are  astonishingly  accurate.  Interestingly,  the  estimated  sum  tends  to  err  on 
the  side  of  being  slightly  too  big.  The  largest  relative  error  is  less  than  0.03%, 
though  it  would  be  very  slightly  larger  if  one  could  add  in  the  truncated  values. 
Nevertheless,  such  good  results  are  very  encouraging.  They  lend  credence  to  the 
approximating  procedure  as  a  whole. 
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Fig.  13.  The  first  four  approximated  eigenfunctions  for  h  =  0.1  and  the  Gasser- 
Muller  boundary  modification  to  the  biweight  kernel.  Figure  (a)  is  the  first  eigen¬ 
function;  Figure  (6)  the  second ;  Figure  (c)  the  third;  and  Figure  (d)  the  fourth. 


Fig.  14.  The  first  20  estimated  eigenvalues  for  h  =  0.5, 0.3,  and  0.1  and  the 
Gasser-Muller  modification  to  the  biweight  kernel.  The  solid  line  with  large  x ’s 
is  h  =  0.5 ;  the  solid  line  with  blocks  is  h  =  0.3;  and  the  solid  line  with  small  x’s 
is  h  =  0.1. 
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Table  4 


Comparison  of  the  sum  of  the  estimated 
eigenvalues  and  their  true  sum. _ 


Sum  of  Eigenvalues 

h 

True 

Estimated 

0.5 

1.9074 

1.9074 

0.4 

2.2646 

2.2646 

0.3 

2.8598 

2.8598 

0.2 

4.0503 

4.0504 

0.1 

7.6217 

7.6234 

The  components  of  the  null  kernel  density  process,  KDPo^r^,  are  defined  as 

ZNj  =  Jq  - 1]^, 


for  j  >  1  and  the  components  of  the  limiting  process  are  defined  as 


Z*  =  j  <j>j(tv)KDP h(iv)dw 


Z,= 


A 0 


1  -  Aq 


for  j  >  1.  Although  the  components  clearly  depend  on  h ,  it  is  not  included  in 
the  notation  for  simplicity.  By  Lemma  3.3.1  and  Proposition  2  of  Shorack  and 
Wellner  (1986),  page  208,  one  can  conclude  that  Z\,  Zi , . . .  are  iid  N(0,1)  random 
variables  under  H0.  Since  the  functional  defining  the  components  is  continuous 
(as  a  result  of  Lemma  3.3.2)  and  is  measurable  [Cp,d),  by  Theorem  3.11  of 
Ruymgaart  (1988)  and  Theorem  3.2.3,  one  has 


ZNj 


Zi 


under  H0  as  m  A  n  — ►  oo. 
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It  has  just  been  shown  that  the  sample  components  converge  in  distribution 
singly  to  the  appropriate  limiting  component.  Later,  the  joint  convergence  in 
distribution  of  the  set  Zff  j, . . . ,  Zjg\f  will  be  needed.  This  result  is  easily  derived 
by  an  application  of  the  Cramer- Wold  device  and  an  appeal  to  the  same  theorems 
used  above.  Formally,  Lemma  3.3.3  states  the  result. 

Lemma  3.3.3.  For  any  fixed  integer  M  >  1  and  fixed  bandwidth  h,  one  has 
(Ztfi, ZffM)  -$->  (Z\, . . . ,  Zm)  MmAn-*oo. 

The  components  have  several  very  important  interpretations.  First,  zh 
is  the  jtfl  generalized  Fourier  coefficient  in  an  expansion  of  d^(w)  —  1  in  the 
eigenfunctions.  This  interpretation  will  be  significant  in  a  later  subsection  where 
an  orthogonal  series  estimator  of  d^  is  based  on  the  eigenfunctions.  Second, 
Ztfj  is  a  linear  rank  statistic.  This  can  be  seen  as  follows: 

ZNj  =  JQ  <t>jW\*hW  - 

=  /o  IL  iK‘  (nr)  *?(WH  dD»w  -  f0 

where  s  =  s(w,h).  This  last  quantity  has  the  form  of  a  linear  rank  statistic  with 
the  score  function 

<“>  =  H  (nr) 

This  score  function  can  be  termed  a  ‘backward’  smoothing  of  (tt»).  The  term 
backward  is  applied  because  the  integration  is  with  respect  w  and  not  u  as  is 
usual.  The  term  Jq  <f>^(w)dw  is  a  centering  constant  and  is  equal  to  Jq  J^(u)du. 
For  rank  statistics,  the  centering  constant  arises  naturally  by  centering  Dy{u) 
by  the  appropriate  function,  which  under  H0  is  u.  This  converts  Dpt(u)  to 
Djy(u)  -u  which  is  a  multiplicative  constant  (t/N)  away  from  being  the  empirical 
comparison  distribution  process,  CD#,  under  H0,  that  is, 

'/NZ'h,  =  /  J*(<i)iCDN{v) 
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=  •Sn  [  J*(<‘)4dn(«)  -  *\ 

=  vn  hr1  j}(u)dD„(u)  -  jf 1  ^(ujdu . 

Figures  15  through  17  picture  these  score  functions.  In  order  to  make  them 
easier  to  view,  they  have  been  scaled  so  that  each  attains  a  maximum  absolute 
value  of  1.  The  important  aspect  of  score  functions  is  their  shape,  not  their  mag¬ 
nitude.  Each  figure  presents  four  graphs  corresponding  to  j  =  1,2, 3, 4.  Figure  15 
employs  a  bandwidth  of  h  =  0.5;  Figure  16  uses  h  =  0.3;  and  Figure  17  employs 
h  =  0.1.  The  property  of  the  number  of  zero  crossings  of  the  eigenfunctions  is 
preserved  by  the  score  functions.  Recalling  the  discussion  of  score  functions  in 
Subsection  2.2.4,  the  first  component  is  seen  to  test  location  and  the  second  scale. 
Higher  order  components  are  testing  higher  frequency  departures  of  from 
uniformity.  Although  it  would  be  nice  to  give  these  higher  frequency  departures 
moment  interpretations  such  as  skewness  and  kurtosis,  such  interpretations  have 
not  been  demonstrated. 

Score  functions  based  on  eigenfunctions  and  boundary  kernels  are  entirely 
novel:  this  is  a  new  procedure  for  generating  score  functions.  There  are  several 
attractive  features  to  this  methodology.  First,  one  is  generating  an  entire  fam¬ 
ily  of  score  functions — starting  with  location,  moving  to  scale,  and  then  higher 
order  departures.  There  is  a  link  between  these  since  they  have  a  unified  ori¬ 
gin.  Portmanteau  tests  for  departures  up  to  the  fourth  order  score  function,  (».e. 
j  =  4),  have  been  proposed  in  the  literature  [see  Boos  (1986)].  However,  the 
score  functions  employed  have  no  common  origin  making  the  entire  procedure 
seem  somewhat  ad  hoc.  Second,  these  score  functions  are  parametrically  defined 
by  the  bandwidth,  h.  Selecting  the  bandwidth  allows  one  to  select  the  properties 
of  the  test,  that  is,  one  can  tune  the  bandwidth  so  that  the  components  are  more 
powerful  against  certain  classes  of  underlying  distributions.  Eubank,  LaRiccia, 
and  Rosenstein  (1987)  provide  a  unified  origin  for  score  functions  by  taking  them 
to  be  an  orthonormal  basis.  There  are  several  reasons  why  an  approach  based 
on  eigenfunctions  is  desirable.  First,  to  protect  against  different  classes  of  dis¬ 
tributions  requires  a  complete  change  of  the  basis.  Such  a  change  of  basis  may 


85 


Fig.  15.  The  first  four  score  functions  corresponding  to  the  first  four  components 
of  the  kernel  density  process  based  on  the  Gasser-Muller  boundary  modification 
to  the  biweight  kernel.  The  bandwidth  is  h  =  0.5.  The  score  functions  have  been 
scaled  so  that  the  maximum  absolute  value  attained  is  1.  Figure  (a)  is  the  first 
score  function;  Figure  (b)  the  second;  Figure  (c)  the  third;  and  Figure  (d)  the 
fourth. 


Fig.  16.  The  first  four  score  functions  corresponding  to  the  first  four  components 
of  the  kernel  density  process  based  on  the  Gasser-Muller  boundary  modification 
to  the  biweight  kernel.  The  bandwidth  is  h  =  0.3.  The  score  functions  have  been 
scaled  so  that  the  maximum  absolute  value  attained  is  1.  Figure  (a)  is  the  first 
score  function;  Figure  (6)  the  second;  Figure  ( c )  the  third;  and  Figure  (d)  the 
fourth. 
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Fig.  17.  The  first  four  score  functions  corresponding  to  the  first  four  components 
of  the  kernel  density  process  based  on  the  Gasser-Muller  boundary  modification 
to  the  biweight  kernel.  The  bandwidth  is  h  —  0.1.  The  score  functions  have  been 
scaled  so  that  the  maximum  absolute  value  attained  is  1.  Figure  (a)  is  the  first 
score  function;  Figure  (6)  the  second;  Figure  (c)  the  third;  and  Figure  {d)  the 
fourth. 

have  important  implications  for  any  estimator  of  the  comparison  density  based 
on  it  (cf.  Subsection  2.4.4).  Second,  the  eigenfunction-based  scores  have  quite 
unusual  shapes  that  would  be  hard  to  match  by  standard  orthogonal  functions; 
for  instance,  puts  more  weight  on  the  tails  than  would  be  observed  for 

the  Legendre  polynomials  or  trigonometric  functions.  Often,  the  tails  are  pre¬ 
cisely  the  area  of  interest.  Thus,  the  eigenfunction  approach  generates  interesting 
shapes  that  would  be  difficult  to  obtain  otherwise.  There  is  great  convenience 
and  theoretical  unity  in  a  procedure  based  on  the  eigenfunctions. 

The  asymptotic  relative  efficiencies  (ARE’s)  of  the  components  relative  to 
standard  two  sample  tests  are  taken  up  in  detail  in  Section  4.  There  is  a  certain 
amount  of  ground  work  that  needs  to  be  performed  in  order  to  define  the  ARE’s 
for  the  components.  This  work  is  most  properly  done  in  Section  4.  Suffice  it  to 
say  for  now  that  altering  the  bandwidth  does  truly  affect  the  properties  of  the 
components.  These  score  functions  give  one  the  ability  to  chodse  the  tests  in  a 
unified  manner. 
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Rank  statistics  enjoy  a  very  important  invariance  property.  When  properly 
centered  and  scaled,  it  doesn’t  matter  which  sample  is  called  the  first  sample. 
The  resulting  rank  statistics  will  have  the  same  magnitude  (but  different  signs). 
Hence,  tests  based  on  either  statistic  will  always  reach  the  same  conclusion.  Un¬ 
fortunately,  the  rank  statistic  Zyj  is  not  properly  centered  in  small  samples 
for  this  to  be  the  case.  As  shown  above,  Zffj  is  centered  by  its  asymptotic 
mean,  fg  J^(u)du.  Asymptotically,  everything  works  out  fine.  In  finite  samples, 
Jo  (u)du  can  be  sufficiently  different  from  the  small  sample  mean  that  invari¬ 
ance  is  lost.  In  fact,  if  m  and  n  are  very  different,  invariance  may  be  lost  to  the 
point  that  one  may  reach  different  conclusions  a  significant  amount  of  the  time. 

To  demonstrate  these  statements,  let  U*  =  ^  be  the  rank 

statistic  where  R{  is  the  rank  of  Xx  in  the  pooled  sample.  Under  H0, 


E  =  E[U* 
=  E 


t=l 


»= l 


since  each  is  marginally  distributed  as  uniform  over  1, . . . ,  N.  Let 


Now,  it  is  true  that  >/N(E  —  /q  J^(u)du)  — ►  0  as  N  — ►  c  *■  This  is  shown  as 
part  of  the  proof  of  Theorem  3.2.1  in  Appendix  B.  By  Lemma  A  on  page  20 
of  Serfling  (1980),  U  has  the  same  limiting  normal  distribution  as  Z/yry.  Let 
V*  =  i  Jj  (S{/N)  where  Sx  is  the  rank  of  Yx  in  the  pooled  sample.  It 
follows  that  E[V*j  =  E.  Let 


K  i  -  W* 
V) 


{V*  -  E) 


{V*  -  E). 
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The  random  variable  V  has  the  same  limiting  normal  distribution  as  U  under 
H0.  Simple  algebra  yields  the  invariance  result:  U  +  V  =  0.  Instead  of  centering 
by  E ,  suppose  F  =  Jq  Jy  (u)du  is  used.  In  this  case,  the  random  variables  U  and 
V  no  longer  have  mean  zero  but  are  biased.  As  noted  above,  this  bias  disappears 
asymptotically  but  may  be  significant  in  finite  samples.  The  means  of  U  and  V 
are 


(E  -  F), 


m =vv,£'f)- 


If,  say,  m  ys>  n,  then  U  will  be  considerably  more  biased  than  V.  Comparing 
each  to  a  standard  normal  reference  distribution,  one  would  expect  U  to  reject 
more  often  than  V  under  H0.  To  avoid  this  problem  and  to  preserve  invariance, 
the  small  sample  means  will  be  subtracted  throughout.  The  asymptotics  are 
unaffected  and  the  small  sample  properties  improved. 

The  small  sample  mean  of  U *  is 


This  last  formula  is  a  more  tractable  form  for  calculation.  The  quantity  inside  the 


brackets  is  the  kernel  smoothing  of  the  data  (1/JV,2/JV, . . . ,  N/N) .  The  integral 


can  be  evaluated  by  Simpson’s  rule  at  /  +  1  points  (1  even),  hence  one  needs 
the  kernel  smoothing  of  (l/N,2/N,...,N/N)  at  these  l  +  1  points.  This  can  be 
calculated  using  the  same  routine  that  calculates  the  estimate  of  the  comparison 
density. 

As  noted  in  the  introduction,  it  is  planned  to  use  the  components  as  the  basis 
of  the  integrated  testing  and  estimation  procedure.  The  exact  procedure  has  yet 
to  be  introduced;  however  a  justification  for  using  components  can  be  made  at 
this  point.  Even  if  statistics  such  as  h  fit  the  criteria  outlined,  the  components 
would  still  be  of  greater  interest.  A  test  based  on  components  examines  the  first 
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M  of  them  and  gives  each  equal  weight.  This  is  in  contrast  to  statistics  such 
ip'ff  ^  or  the  Cram£r-von  Mises  which  use  all  the  components  but  downweight 
each  successively  according  to  the  eigenvalues  of  the  covariance  kernel.  This 
is  clear  from  representations  such  as  (3.3.1)  and  (2.2.3).  Because  they  employ 
all  the  components,  such  statistics  are  consistent  against  any  alternative.  By 
this  is  meant  that  the  probability  of  rejecting  H0  if  H0  is  false  tends  to  1  as 
m  A  n  — ►  oo.  There  is  a  price  to  be  paid  for  this  consistency,  however.  The 
weights  on  the  components  drop  off  so  quickly  that  it  takes  a  tremendous  amount 
of  data  to  detect  an  alternative  which  effects  one  of  the  higher  order  components 
[cf.  Randles  and  Wolfe  (1979),  page  383].  This  will  be  seen  to  be  the  case  in 
Section  4.  Such  statistics  begin  to  lose  power  for  alternatives  affecting  even  the 
second  or  third  component. 

In  contrast,  a  procedure  which  tests  only  the  first  M  components  and  gives 
each  equal  weight  should  have  good  power  characteristics  against  alternatives 
affecting  these  components.  However,  it  will  be  inconsistent  against  alternatives 
which  effect  only  components  other  than  the  M  considered.  Such  a  tradeoff 
seems  a  reasonable  one  on  several  grounds.  First,  since  the  statistics  will  be  seen 
to  have  poor  power  against  even  low  order  components,  consistency  is  not  much 
solace.  Second,  since  M  is  under  the  control  of  the  user,  it  can  be  chosen  to 
protect  against  as  broad  a  class  as  is  felt  necessary  or  suitable.  One  also  has  the 
comfort  that  this  class  is  much  better  protected  against  than  by  the  standard 
statistics. 

In  summary,  this  subsection  has  defined  the  basic  machinery  necessary  to 
define  the  components.  Properties  of  the  covariance  kernel  of  the  null  kernel  den¬ 
sity  process  have  been  investigated.  The  eigenfunctions  and  eigenvalues  of  this 
covariance  kernel  have  been  defined  and  their  properties  explored.  A  numerical 
method  of  finding  these  was  suggested.  The  components  were  defined  and  given 
several  very  important  interpretations.  The  components  are  both  generalized 
Fourier  coefficients  and  linear  rank  statistics.  As  linear  rank  statistics  they  are 
seen  to  be  testing  successively  higher  frequency  departures  of  j  from  unifor¬ 
mity.  A  small  sample  correction  to  the  mean  of  the  components  was  suggested 
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to  preserve  invariance.  Finally,  a  first  comparison  of  a  test  based  on  components 
versus  the  usual  type  of  portmanteau  statistic  was  made.  It  was  argued  that  one 
would  expect  the  components  to  have  superior  power  against  many  alternatives. 
This  will  be  seen  conclusively  in  Section  4. 

3.3.4.  The  Space  Spanned  by  the  Eigenfunctions.  In  this  subsection,  the 
space  spanned  by  the  eigenfunctions  of  the  null  covariance  kernel  is  examined. 
The  properties  of  decompositions  based  on  them  are  also  explored.  It  is  seen 
that  from  a  practical  aspect,  this  question  is  of  little  import.  However,  from 
a  theoretical  aspect  certain  interpretations  and  representations  do  change.  The 
question  of  exactly  what  space  the  eigenfunctions  do  span  is  not  resolved.  Given 
that  there  is  no  explicit  representation  for  the  eigenfunctions,  their  properties 
are  all  the  more  difficult  to  determine. 

In  deriving  representations  such  as  (3.3.1),  it  is  necessary  to  know  whether 
the  eigenfunctions  form  a  complete  orthonormal  basis  for  the  space  in  which 
the  stochastic  process  resides.  Since  it  will  be  necessary  to  decompose  both 
KDPotv  and  KDP^,  an  examination  of  the  spaces  in  which  these  processes 
reside  is  necessary.  Lemma  3.2.2  states  the  KDP*  is  continuous  with  probability 
1.  A  more  precise  statement  than  this  is  possible.  The  process  KDP^  is  in  the 
space  s£  with  probability  1,  where 

SC  =  if  :  f(w)  =  J  (~j~)  siu)du,s  =  s(w,h),g  6  C[0, 1)}. 

The  process  KDPojy^  is  in  the  space  Sp  with  probability  1,  where 

=  {/  :  /M  =  Jq  Mdu,s  =  s{w,h),g  6  D[0,1]}, 

and  JD(0,  l]  is  the  space  of  all  functions  on  [0, 1]  which  are  continuous  from  the 
right  and  have  limits  from  the  left.  From  the  definitions,  it  follows  that  Sg  C  Sp 
so  that  the  fundamental  question  is  whether  the  eigenfunctions  form  a  complete 
orthonormal  basis  for  Sp.  If  they  do,  then  the  following  two  results  follow: 

M 

(3.3.3)  ||4  -  1  -  ^  0  as  M  oo, 

;= 1 
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(3.3.4)  v  ’  =  £***’. 

y=i 

where  ||  •  ||  is  the  £2  norm,  =  denotes  equal  in  distribution  and  denotes 
almost  sure  (with  probability  1)  convergence.  See  Shorack  and  Wellner  (1986), 
pages  270  ff.,  for  demonstrations  of  these  facts.  Basically,  (3.3.3)  is  a  statement 
of  completeness  and  (3.3.4)  is  a  form  of  Parseval’s  theorem. 

If  the  eigenfunctions  are  not  complete  for  Sp  then  equation  (3.3.3)  and 

(3.3.4)  must  be  amended  slightly.  Let  Sg  be  the  space  actually  spanned  by  }. 
Let  Pd  be  the  projection  of  d^{w)  —  1  onto  Sg  and  PKDP^  be  the  projection  of 
KDP/,  onto  Sg.  Instead  of  (3.3.3)  and  (3.3.4),  one  has 

M 

(3.3.5)  ||P  d{w)  -  ^  ^OasM-400 

3= 1 

(3.3.6)  jf  *  PKDPA(«,)2^  t  p^Zf. 

The  results  are  the  same,  but  now  one  must  deal  with  the  projections  instead  of 
the  original  processes. 

This  is  inconvenient  mathematically;  practically  it  makes  no  difference.  It 
makes  no  difference  because  when  (3.3.3)  and  (3.3.4)  are  actually  applied,  they 
are  truncated  at  some  point,  M.  Hence,  one  is  always  dealing  with  the  projection 
onto  a  subspace;  for  instance,  in  Subsection  3.3.6  the  orthogonal  series  estimate 
of  d^j, 

M 

=  1  +  Y. 

3= 1 

is  introduced  as  a  truncated  decomposition  of  d^(tn)  - 1.  Since  M  is  always  finite, 
what  does  or  does  not  happen  in  the  tail  of  the  sequence  is  of  little  importance. 
This  is  particularly  true  since  the  components,  Zjy  •,  are  becoming  small  with 
high  probability  for  increasing  j.  Note  that  Zjyy  is  AN(0,(1  —  Xo)0^  /  [N  Xo)) 
under  H0.  Because  0y  — ►  0  as  j  — >  00,  the  Zjyy’s  are  becoming  small  with  high 
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probability  as  j  increases.  The  higher  order  eigenfunctions  simply  do  not  carry 
much  weight. 

In  Section  4,  the  approximate  distribution  of  <p £  under  H0  is  found  by  nu¬ 
merically  inverting  the  characteristic  function  of 

M 

(3.3.7) 

J= 1 

One  chooses  M  sufficiently  large  so  that  5)  is  very  small  for  j  >  M.  The 
percentage  points  of  (3.3.7)  are  then  used  in  proxy  for  those  of  This  is  the 
methodology  employed  by  Durbin  and  Knott  (1972)  with  excellent  results.  The 
calculated  distribution  does  not  depend  on  whether  the  tail  of  the  sequence  fills 
out  to  be  complete  for  Sq  or  not. 

Shorack  and  Wellner  (1986)  state  that  a  necessary  and  sufficient  condition 
for  the  eigenfunctions  to  span  a  given  space  and  for  the  eigenvalues  to  all  be 
positive  is  that  Ch(v,w)  be  positive  definite  over  that  space.  Here,  it  is  required 
that 

/  /  f(v)Cfi{v,w)f{w)dvdxv  >  0, 

Jo  Jo 

for  all  /  €  Sp  with  /  ^  0.  Lemma  3.3.2  states  that  C^(u,  w)  is  positive  semi- 
definite  over  this  space.  Checking  positive  definiteness  turns  out  to  be  no  small 
task.  Lemma  3.3.4  gives  an  equivalent  condition. 

Lemma  3.3.4.  C^(v,  w)  is  positive  definite  on  Sp  if  and  only  if  the  integral 
equation, 

l  Jo  K’  (nr) /{w)dw  =  c’ 

has  no  solution  for  f  €  Sp  where  f  ^  0  and  c  =  0,1. 

Unfortunately,  the  condition  of  Lemma  3.3.4  is  no  easier  to  check  than  the 
initial  statement  of  positive  definiteness.  The  integral  equation  of  Lemma  3.3.4 
is  a  Fredholm  integral  equation  of  the  first  kind.  These  are  among  the  hardest 
to  solve  both  analytically  and  numerically;  see,  for  example,  Marti  (1982)  and 
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Lukas  (1980).  Further,  the  integrand  involves  the  boundary  kernel  and  so  is  a 
rational  function  of  u>;  it  is  not  even  a  ‘nice’  Fredholm  equation  of  the  first  kind. 
There  seems  no  hope  of  an  analytic  solution.  Numerical  approaches  are  doubly 
difficult  in  this  context  since  the  question  is  over  the  existence  of  a  solution, 
not  finding  a  solution  which  is  known  to  exist.  Hence,  one  would  be  put  in  the 
position  of  trying  to  gauge  whether  the  approximation  is  converging  to  a  true 
solution  or  not.  A  further  difficulty  of  numerical  approaches  is  restricting  the 
solution  to  reside  in  the  space  Sp.  Such  a  restriction  would  be  very  difficult  to 
impose. 

The  question  of  exactly  what  space  the  eigenfunctions  span  will  not  be  re¬ 
solved  here.  Where  appropriate,  the  implications  of  the  eigenfunctions  forming 
a  complete  basis  for  Sp  or  failing  to  do  so  will  be  noted. 

3.3.5.  The  Subset  Chi-Square  Test.  In  this  subsection,  a  new  test  is  pre¬ 
sented.  This  test  is  referred  to  as  the  subset  chi-square  test.  It  is  applied  to  the 
components,  Z^i, . . . ,  Z^m,  in  Subsection  3.3.6.  It  is  seen  to  have  several  de¬ 
sirable  properties.  First,  it  represents  a  compromise  between  two  existing  tests: 
the  standard  chi-square  test  and  the  independent  tests  method.  This  latter  tests 
the  components  one  at  a  time  at  a  smaller  size  than  desired  so  that  the  overall 
test  has  the  desired  size.  Second,  the  subset  chi-square  test  indicates  not  only 
that  some  components  are  significant  when  it  rejects,  but  also  which  ones  are 
significant.  Third,  it  lends  itself  to  graphical  display  of  a  criterion  function  much 
in  the  way  Akaike’s  (1974)  AIC  and  Parzen’s  (1977)  CAT  criteria  do  in  time 
series  analysis. 

Let  5i, . . . ,  Sm  be  independent  normally  distributed  random  variables  with 
variance  1  and  suppose  that  5,  has  mean  /*,.  A  test  jf  H0:mi  =  M2  =  •••  = 
MAf  =  0  versus  H a:m  /  0  for  at  least  one  i  is  desired.  The  need  to  conduct  tests 
of  this  nature  is  not  unknown  in  statistics.  For  instance,  in  time  series  analysis 
under  the  null  hypothesis  of  no  autocorrelation,  the  first  M  standardized  sample 
autocorrelations  are  asymptotically  iid  N(0,1)  [see  Newton  (1988),  page  158). 
The  two  most  commonly  used  tests  in  this  framework  are  the  chi-square  test  and 
the  independent  tests  method. 
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There  do  not  exist  optimal  tests  such  as  the  uniformly  most  powerful  unbi¬ 
ased  test  in  this  case-the  alternatives  are  just  too  general.  One  can,  of  course, 
form  several  classically  motivated  statistics:  likelihood  ratio,  Wald,  Rao’s  effi¬ 
cient  score  can  all  be  derived.  In  this  situation,  one  usually  sees  one  of  two  tests 
applied.  The  first  is  the  chi-square  test  statistic 

M 

1  =  1 

which  is  distributed  as  x\f  under  H0.  The  second  is  to  test  each  of  the  5,’s  one 
at  a  time  and  adjust  the  size  of  each  test  so  that  the  overall  test  has  the  desired 
size,  a.  This  is  termed  the  independent  tests  method.  In  this  procedure,  H0  is 
rejected  if  and  only  if 

sf  >  X?((l  -  a)1/"), 

for  any  *  =  1, . . . , M ,  where  Xi((l  -  a)llM)  is  the  quantile  of  the  xf  distribution 
evaluated  at  (1  -  a)l!M .  The  overall  test  does  indeed  have  size  a. 

There  are  several  considerations  in  choosing  a  test.  The  first  is  that  it 
should  have  reasonable  power  against  a  wide  range  of  alternatives.  A  second 
consideration  is  that  if  the  null  hypothesis  is  rejected,  the  test  should  indicate 
why  it  has  been  rejected.  That  is,  it  should  indicate  which  of  the  5,’s  were 
judged  to  have  non-zero  means.  For  the  components,  this  information  is  useful 
on  two  grounds.  First,  knowing  which  rank  statistic  is  large  serves  as  a  numerical 
indicator  to  accompany  the  estimate  of  the  comparison  density.  Graphs  should 
always  be  accompanied  by  diagnostics  to  reinforce  the  message.  Knowing  that 
the  second  component  is  the  major  difference  between  the  samples  is  meaningful. 
Second,  it  will  be  seen  that  one  can  construct  an  orthogonal  series  estimate  of 
based  on  the  significant  components.  This  follows  from  their  interpretation 
as  generalized  Fourier  coefficients.  Finally,  it  would  also  be  desirable  if  the  test 
has  some  graphical  components  in  the  manner  of  AIC  or  CAT.  One  would  like 
something  more  than  just  a  list  of  significant  components. 

What  is  needed  is  a  unified  framework  in  which  to  discuss  these  two  tests 
and  to  derive  new  ones.  An  appropriate  framework  is  suggested  by  analogy  with 
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the  optimal  subset  regression  techniques  of  Furnival  (1971)  and  Furnival  and 
Wilson  (1974).  The  idea  v  -hind  optimal  subset  regression  is  to  find  that  subset 
of  a  given  size,  k,  of  the  regressors  which  yields  the  least  RSS  (residual  sum  of 
squares).  One  can  then  vary  k  and  use  some  criterion  function  such  as  Mallow’s 
Cp  to  help  choose  a  subset  size.  The  key  concept  here  is  the  examination  of  all 
subsets  to  yield  the  best  result.  By  analogy,  one  could  look  at  all  subsets  of  size  k 
of  , . . . ,  Sjj  for  k  =  1, . . . ,  M  and  reject  H0  if  the  sum  of  the  members  of  some 
subset  were  found  to  be  too  large.  The  philosophy  is  that  Sp . . . ,  should  not 
be  considered  only  singly  but  should  also  be  allowed  to  reinforce  one  another. 

Mathematically,  such  a  test  works  out  to  be:  Reject  H0  if  and  only  if 


5?  +  •••  +  5?  >DM(k,a), 

for  some  k,  1  <  k  <  M  and  some  (*!,...,»*)•  The  indices  (t"i, ...,»*) 
range  over  all  (^f)  subsets  of  size  k  taken  from  {l,...,Af}.  The  sequence 
Dm{1,  a),...,  Dm{M,  a)  are  critical  values  which  must  be  selected.  This  se¬ 
quence  determines  the  properties  of  the  test  and  must  keep  the  overall  size  at 
a. 

To  perform  this  test,  one  needn’t  actually  look  at  all  2M  —  1  subsets.  In  fact, 
even  the  branch  and  bound  algorithm  of  Furnival  and  Wilson  is  unnecessary.  All 
one  need  examine  are  M  subsets.  The  above  test  is  equivalent  to:  Reject  H0  if 
and  only  if 


+  •••  +  5 


2 

(M-Jfc+1) 


>DM(k,a), 


for  some  k,  1  <  k  <  M,  where  5^,..., 5^ j  are  the  order  statistics  of 
There  is  no  onerous  computational  burden  at  all.  However,  the 
optimal  subset  analogy  is  more  motivational  than  simply  starting  with  the  order 
statistics. 

At  first  blush,  there  seem  to  be  at  least  three  reasonable  choices  for  the 
sequence  of  critical  values,  D^(  1,  a), . . .  a).  These  are: 


1.  DM{k,a)  =  X2m(1  ~  a); 

2.  DM(k,a)  =  kx\((l  -  a)1/**); 
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3.  DM(k,a)  = 

These  shall  be  referred  to  as  sequences  (l),  (2),  and  (3),  respectively.  Sequence 
(1)  yields  the  ordinary  chi-square  test.  For  the  chi-square  test,  the  null  hypothesis 
is  rejected  if  YliLi  S*  >  D^(M,a)  =  x\f(l  —  «)•  If  the  chi-square  rejects, 
the  test  based  on  sequence  (l)  will,  also.  Suppose,  for  some  subset  of  size  k, 
one  has 

Sl  +”•  +  «,*  >DM(k,a). 

Then,  of  course, 

£  Si  *  Sl  +  ■  ■  ■  +  Sl  >  DM(k,  c)  =  x2M(l  -  a) , 

1=1 

so  the  chi-square  test  rejects  as  well.  Thus,  sequence  (1)  is  equivalent  to  the 
standard  chi-square  test. 

Sequence  (2)  is  equivalent  to  the  independent  tests  method.  Clearly,  if  the 
independent  tests  method  rejects  then  the  test  based  on  (2)  will,  also.  Suppose, 
for  some  k  and  some  subset  (»i, . . .  ,**)  that 

«?,+  •  +  «?.  >*xi((l-  a)1/"). 

If  5?  <  x?((l  ~  for  j  =  l,...,fc  then  the  above  cannot  hold.  Hence 

5?  >  x?((l  —  a)x!M)  for  some  j  and  the  independent  tests  method  also  rejects. 
Thus  sequence  (2)  is  equivalent  to  the  independent  tests  method. 

The  critical  sequence  (3)  yields  what  is  to  be  called  the  subset  chi-square  test. 
The  critical  value  is  a  natural  one  since,  taken  alone,  each  term  S }  +  •  •  •  +  S} 
is  distributed  as  xjfe  under  H0.  To  keep  the  overall  test  at  size  a,  one  needs 
to  adjust  the  size  of  each  term  in  the  critical  sequence.  This  is  the  purpose  of 
evaluating  t  -  quantile  at  the  point  t(M,  a).  All  three  tests  are  now  in  a 
common  f  :  work  so  that  comparisons  are  possible. 

Figure  If  esents  the  critical  regions  for  the  chi-square  test  and  the  in¬ 
dependent  tests  method  for  M  =  2.  The  square  is  the  critical  region  for  the 
independent  tests  method  and  the  circle  is  the  critical  region  for  the  chi-square 
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Fig.  18.  Critical  regions  for  the  chi-square  test  and  the  independent  tests  method 
for  M  —  2  and  a  =  0.05.  The  critical  region  for  the  chi-square  test  is  the  circle 
and  that  for  the  independent  tests  method  is  the  square. 


test.  These  regions  are  for  a  size  of  0.05.  The  tests  reject  when  the  pair  (Si,  S2) 
falls  outside  the  appropriate  figure.  Comparing  the  two,  it  is  possible  to  draw 
some  tentative  conclusions  about  which  test  is  more  powerful  for  certain  kinds 
of  alternatives.  If  the  alternative  is  in  the  direction  of  one  of  the  axes,  the  inde¬ 
pendent  tests  method  should  outperform  the  chi-square  test  because  its  critical 
region  is  shorter  in  that  direction.  Similarly,  if  the  alternative  is  in  the  direction 
of  one  of  the  diagonals,  the  chnsquare  test  should  do  better. 

Figure  19  presents  the  critical  region  for  the  subset  chi-square  test.  The 
test  rejects  anytime  the  pair  (Si,  S2)  falls  outside  either  the  circle  or  the  square. 
Comparing  the  dimensions  of  these  shapes  to  those  in  Figure  18,  one  sees  those 
in  Figure  19  are  slightly  larger.  Visually,  the  subset  chi-square  test  appears  to  be 
a  compromise  between  the  other  two.  It  is  a  compromise  of  the  independent  tests 
method  by  cutting  off  the  corners  of  the  square.  It  is  a  compromise  of  the  chi- 
square  test  by  reducing  the  distance  in  the  direction  of  the  axes.  If  the  alternative 
is  in  the  direction  of  an  axis,  the  independent  tests  method  and  the  subset  chi- 
square  should  be  about  the  same  and  both  better  than  the  standard  chi-square 
test.  If  the  alternative  is  in  the  direction  of  a  diagonal,  the  chi-square  test  should 


Fig.  19.  Critical  region  for  the  subset  chi-square  test  for  M  =  2  and  a  =  0.05. 
The  test  rejects  any  time  the  observation  (5i,<?2)  f<dl*  outside  either  the  square 
or  the  circle. 


perform  the  best  followed  by  the  subset  chi-square  and  the  independent  tests 
method. 

A  small  power  study  confirms  these  findings.  Figure  20  presents  the  power 
curves  of  the  three  tests  for  M  =  4  and  alternative  (mi>M2iM3«M4)  =  9 * (1, 1, 1,  l). 
The  scalar  q  ranges  from  0  to  6.4.  The  power  curves  are  constructed  by  simula¬ 
tion  techniques  using  10,000  replications.  For  each  replication,  four  independent 
normal  random  variables  are  drawn  with  mean  0  and  variance  1.  For  this  real¬ 
ization,  a  loop  steps  through  the  range  of  q  values.  For  each  value  qj  of  q  the 
appropriate  means  are  added  to  the  normal  random  variables  drawn  above.  The 
sample  power  curve  for  the  itfl  test  at  the  alternative  qj  is  calculated  as  1  if  the 
test  rejects  and  0  otherwise.  The  estimates  are  then  averaged  over  all  realizations 
to  generate  Figure  20.  Clearly,  there  is  reuse  of  each  sample  over  the  range  of 
alternatives.  However,  this  technique  will  converge  to  the  correct  values  and  it 
does  impose  monotonicity  on  the  estimated  power  functions.  From  Figure  20, 
one  can  see  that  the  ordering  of  the  tests  is  as  predicted. 

Figure  21  repeats  the  same  method  for  the  alternative  (mi,M2»M3>M4)  = 
q  •  (1,0, 0,0).  Again  the  results  are  as  predicted,  however  the  curves  are  much 
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closer  together  than  in  Figure  20.  The  subset  chi-square  and  the  independent 
tests  method  are  virtually  identical  and  the  standard  chi-square  is  only  slightly 
worse. 

The  subset  chi-square  represents  a  good  compromise  between  the  indepen¬ 
dent  tests  method  and  the  chi-square  test.  It  also  possesses  other  features  to 
recommend  it.  How  a  test  fits  into  a  graphical,  model  selection  environment 
must  also  be  considered.  For  instance,  the  chi-square  test  does  not  indicate 
which  components  caused  the  rejection,  only  that  it  did  reject.  The  independent 
tests  method  does  indicate  which  components  are  large  but  considers  them  only 
singly.  Heuristically,  it  seems  possible  that  S*  and  S|  may  be  insignificant  taken 
singly  but  that  Sf  +  5|  may  be  significant.  From  practical  experience,  the  inde¬ 
pendent  tests  method  seems  to  be  ‘stingy’  in  declaring  components  significant. 
This  will  be  seen,  for  example,  in  a  data  set  examined  in  Section  5.  The  subset 
chi-square  test  does  not  suffer  from  these  difficulties. 

The  subset  chi-square  test  lends  itself  to  graphical  display  much  as  AIC  or 
CAT  do.  Define 


C(k)  =  max  S?  +  •  •  •  +  Sf  -  x£(f(M,a)), 

for  k  =  I, . . . ,  M,  where  the  indices,  (*j, . . . ,  **),  range  over  all  subsets  of  size  k 
of  {l,...,M}.  One  then  graphs  C(k)  versus  k.  If  the  values  are  all  negative, 
the  null  hypothesis  is  not  rejected.  Since  C(k)  is  a  function,  it  has  shape  and 
shapes  impart  information.  If  C(fc)  has  a  very  sharp  and  pronounced  peak,  then 
there  is  a  strong  choice  for  a  particular  subset.  If  C(k)  is  fiat  without  much 
of  a  peak,  then  there  are  several  subsets  which  could  be  considered  candidates. 
These  interpretations  will  take  on  greater  meaning  in  the  next  subsection  where 
an  orthogonal  series  estimator  based  on  the  components  is  introduced. 

A  plot  of  C(k)  versus  k  is  only  plotting  the  winner  for  each  subset  size.  One 
could  also  plot  below  C(k)  the  next  largest  value  of  the  criterion  function.  This 
would  give  some  indication  of  the  cost  of  switching  the  smallest  component  in 
the  optimal  subset  with  the  next  smaller  component.  Since  all  these  concepts 
are  easily  written  in  terms  of  order  statistics,  computation  is  not  a  problem. 
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There  is  no  ready  analytic  method  for  determining  the  function  t(M,  a). 
Even  for  M  =  2,  the  integral  is  not  tractable.  The  function  is  found  by  simulation 
methods.  The  procedure  is  to  fix  u  =  t(M,a),  the  nominal  size  of  the  chi- 
squares,  at  various  levels  and  the  estimate  the  true  size  of  the  test,  a.  For  the 
jtfl  realization  of  M  iid  N(0,l)  random  variables,  the  function 


B, 


if  the  test  accepts  for  t  =  u^, 
if  the  test  rejects  for  t  = 


is  found  for  the  grid  of  u^’s  equal  to  u*  =  0.95  +  0.04999999(t  —  l)/25  for  *  = 
1, . . . ,  25.  These  functions  are  then  averaged  over  25,000  replications  to  arrive  at 
the  function  R(M,u ): 

25,000 
3= 1 


The  function  R(M,u)  is  estimating  R(M,u)  which  is  the  inverse  function  of 
t(M,a).  Multiple  uses  are  made  of  the  M  random  variables  for  each  realization 
since  the  test  is  conducted  at  a  grid  of  u  values.  However,  this  preserves  the 
monotonicity  of  the  estimated  function  R(M, u).  The  value  of  t(M,a)  is  found 
by  interpolating  the  function  ttf),u,).  That  is,  one  finds  i  and  i'  with 

*  =  il  —  1  such  that 


R(M,U{)  <  1  —  a 
R(M,  u^i)  >  1  —  a 

and  then  linearly  interpolates  the  pairs  (R{M,  ut),u,),  {R(M,  u,<),u,<)  at  the 
point  1  —  a  to  arrive  at  t(M,  a)  calculated  from  the  u  domain.  From  this  pro¬ 
cedure,  it  is  clear  why  it  is  so  important  that  the  estimated  function  R(M,u) 
be  monotone.  If  it  were  not,  an  inverse  wouldn’t  exist  and  the  procedure  for 
estimating  t(M,  a)  would  fail. 

Figure  22  presents  the  function  ( u,R(M,u ))  for  M  =  2.  The  function  has 
been  linearly  interpolated  between  the  points  ut.  The  function  is  indeed  mono¬ 
tone  as  needed.  The  simulations  are  conducted  for  Af  =  2, . . . ,  15.  In  this 
framework,  presenting  14  graphs  would  be  somewhat  awkward.  Instead,  Table  5 
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A 

Fig.  22.  The  function  R(M,  u)  for  determining  the  critical  sequence  of  the  subset 
chi-square  test  for  M  =  2. 

presents  t(M,  a)  for  a  =  0.05  and  0.01  and  for  M  =  2, . . . ,  15.  These  values  can 
be  used  in  conjunction  with  an  algorithm  to  evaluate  the  chi-square  quantile  to 
find  the  critical  sequence. 

In  this  subsection  a  new  test,  the  subset  chi-square  test,  was  introduced  and 
compared  to  two  existing  tests,  the  chi-square  and  the  independent  tests  method. 
The  subset  chi-square  was  seen  to  be  a  good  compromise  between  these  other 
two  in  terms  of  the  kinds  of  alternatives  it  detects.  Further,  the  subset  chi- 
square  lends  itself  to  the  kind  of  graphical,  model  selection  techniques  which  are 
sought.  It  is  possible  to  define  a  criteria  function,  C(fc),  which  not  only  indicates 
acceptance  and  rejection  but  also  points  out  particular  subsets  which  are  deemed 
significant.  A  simulation  study  was  conducted  to  estimate  the  function  t(M,  a) 
which  determines  the  sequence  of  critical  values  for  the  subset  chi-square  test. 
In  the  next  subsection,  the  subset  chi-square  test  is  applied  to  the  components. 
In  this  case,  important  uses  are  made  of  the  components  found  to  be  significant. 

3.3.6.  Orthogonal  Series  Estimates.  In  this  subsection,  the  subset  chi-square 
test  is  applied  to  the  components.  This  test  leads  naturally  to  an  orthogonal  se¬ 
ries  estimator  of  the  comparison  density.  The  relation  of  the  orthogonal  series 
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Table  5 

Points  at  which  to  evaluate  the  chi-square 
quantile  to  find  the  critical  sequence  for  the 
subset  chi-square  test  ( values  are  multiplied  by 
1000). _ 


Size  of  Overall  Test 

M 

0.05 

0.01 

2 

976.29 

995.44 

3 

985.33 

997.36 

4 

990.08 

998.30 

5 

992.44 

998.66 

6 

994.19 

998.94 

7 

995.82 

999.22 

8 

996.51 

999.37 

9 

997.29 

999.49 

10 

997.77 

999.56 

11 

998.22 

999.64 

12 

998.51 

999.70 

13 

998.64 

999.73 

14 

998.86 

999.77 

15 

998.93 

999.79 

estimator  and  the  boundary  kernel  estimator  is  investigated.  The  orthogonal 
series  estimator  is  found  to  be  a  damped  series  with  the  weights  being  the  eigen¬ 
values  of  the  null  covariance  kernel.  Hence,  one  can  view  the  boundary  kernel  as 
behaving  like  a  damped  orthogonal  series  estimator. 

The  subset  chi-square  test  can  be  applied  directly  to  the  sample  normal¬ 
ized  components,  Zjf  j, . . . ,  Zjq\f.  This  application  is  justified  by  Lemma  3.3.3 
which  gives  the  joint  convergence  in  distribution  of  these  random  variables  to 
limiting  random  variables  which  are  iid  N(0,l)  under  H0.  The  first  result  of  the 
application  of  the  subset  chi-square  test  to  the  components  is  invariance. 

Lemma  3.3.5.  The  subset  chi-square  test  applied  to  the  components  is  invariant 
as  to  which  sample  is  called  the  first. 

The  intriguing  idea  of  the  subset  chi-square  test  is  that  it  returns  subsets  of 
significant  components  as  well  as  an  accept/reject  decision.  Since  these  compo 
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nents  are  also  interpretable  as  generalized  Fourier  coefficients,  it  is  only  natural 
to  define  an  orthogonal  series  estimator  based  on  these.  Suppose  C(k )  attains 
its  maximum  at  k  =  k*  and  that  C(k*)  >  0.  Let  »i, . . .  ,»*.  be  the  subset  which 
generates  this  maximum  value.  An  orthogonal  series  estimate  of  d ^  is 

“ 1  =  YL  ZNjtf(w)- 

je(*i,... ,»'**) 

Other  subsets  could  be  examined  based  on  the  shape  of  C  (A) . 

A  A 

Before  examining  d^j^( w)  any  further,  the  relation  of  and 

oo 

4i0OH -i  =  Y. 

3=1 

should  be  looked  into.  As  pointed  out  in  Subsection  3.3.4,  if  {4> y  }  are  complete  for 
Sp  then  equation  (3.3.3);  applies  otherwise  (3.3.5)  holds.  More  than  this  can  be 
said,  however.  In  Subsection  2.3.4  the  relation  between  Fourier  based  estimates 
and  kernel  estimates  was  pointed  out.  This  relation  is  a  two  way  street.  The 
best  way  to  examine  d^  m  or  d/,i0O  is  not  as  a  decomposition  of  d^(w)  -  1  but  as 
a  function  of  the  original  data.  Recall  the  representation  of  ZNj  38 

ZNJ  =  fQ  «*/(“)<*#*(«)  -  JQ  Jj{u)du 

(3.3.8)  =  J  tf(u)d[DN{u)  -u] 

=  jf 1  [./*(«)  -  jf 1  ^(«)*] «>*(«). 

where  Jy*  is  defined  in  Subsection  3.3.3.  The  large  sample  mean  is  used  here 
instead  of  the  small  sample  mean  to  avoid  a  dependence  on  N. 

It  is  very  interesting  to  note  that  the  family  of  score  functions,  {Jy (u)},  is 
not  orthogonal.  The  components  cannot  be  regarded  as  the  Fourier  coefficients 
of  an  orthogonal  decomposition  of  Dtf( u)  -  u.  Instead,  they  satisfy  the  condition 

Jj(u)Jp(u)du-  J  J^(w)dw  ■  jf  Jy,(f)dt  =  0, 

for  j  j'.  It  does  follow  from  this  condition  that  the  sequence  {Oy  («)}  is  or¬ 
thogonal  where  Oy  (u)  =  jj*(u)  -  Jq  J^(t)dt.  Thus,  the  components  Z*N-  can 
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be  regarded  as  the  generalized  Fourier  coefficients  resulting  from  the  usual  or¬ 
thogonal  series  density  estimation  formulas  using  the  orthogonal  functions  Oy  (u) 
and  applied  to  the  normalized  ranks,  R\JN, ... ,  RmJN.  This  observation  is  clear 
from  the  last  equality  of  equations  (3.3.8).  Since  each  Oy  (u>)  integrates  to  zero, 
the  constant  function  1  is  in  this  basis  as  well.  Its  Fourier  coefficient  is  always  1 
and  is  subtracted  on  the  left  hand  side  of  the  equals  sign  in  the  defining  formulas 

A  A 

for  dh  M  and  dh 

,oo* 

Notice  it  is  claimed  that  the  Oy  (u)’s  are  orthogonal  and  not  orthonormal. 
Figures  15  through  17  presented  the  score  functions,  Jy1,  for  j  =  1,2, 3, 4  and 
h  —  0.5, 0.3, 0.1.  At  the  time,  the  shapes  of  these  score  functions  and  not  their 
magnitudes  were  of  primary  interest.  Hence,  each  was  normalized  to  attain  a 
maximum  absolute  value  of  1.  Now,  however,  the  magnitudes  are  of  interest. 
Figures  23  through  25  present  the  orthogonal  functions  Oy  (u)  for  j  —  1,2, 3, 4 
and  h  =  0.5, 0.3, 0.1.  Figure  23  is  an  overlay  plot  of  the  4  functions  for  h  =  0.5; 
Figure  24  an  overlay  for  h  =  0.3;  and  Figure  25  an  overlay  for  h  =  0.1.  The 
most  striking  feature  is  that  the  magnitudes  of  the  Oy  ’ s  decrease  with  increasing 
j.  The  second  striking  feature  is  that  this  rate  of  decrease  is  slower  with  the 
smaller  bandwidths.  One  suspects  an  interplay  between  this  observation  and  the 
slower  rate  of  decline  in  the  eigenvalues  (recall  Figure  14)  for  decreasing  h.  The 
following  steps  make  this  relationship  mathematically  clear: 


||0?i|2  = 

=  Jo  [/o'  K‘  (nr)  -  /„'  *‘(,H  iu 

-  /0‘  G  l0l  5*  ( V)  (Jir)  * 

- 2  JJ  *?<'>-“  •  J0'  jj  {«•  (nr) 
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Fig.  23.  The  orthogonal  functions  0^(u)  for  j  =  1,2, 3, 4  and  h  =  0.5.  The 
function  0®  5(u)  is  the  solid  line;  O^5  is  the  broken  line  of  x’s;  O®  5  is  the 
broken  line  of  +  's;  09‘5  is  the  solid  line  with  blocks.  The  square  of  the  £2  norms 
of  these  functions  are:  1.0585,  0.6574,  0.1619,  0.0275. 


- ILL 

=  J!  fo  *’{w)  [/o'  (nr)  K*‘  (nr) dtt ~ l]  ^v)dwdv 

=  J  J  <t>j{v)Ch[v,w)<f>^(w)dwdv 

_  1 9k 
~  9j' 

where  s  =  s(w,h)  and  s'  =  s(t,h).  Hence,  the  square  of  the  £ 2  norm  of  the 
function  is  equal  to  the  eigenvalue  6*j. 

A  A 

This  proves  conclusively  that  and  d^>00  are  damped  orthogonal  series 
estimators  where  the  weights  are  the  eigenvalues.  By  this  is  meant  that  the 
estimator  has  the  following  representation: 

ojw 


4,<»M  =  1  +  5Z 


00  17* 

h  /  6 m 
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Fig.  24.  The  orthogonal  functions  Oy  (u)  for  j  =  1,2, 3,4  and  h  =  0.3.  The 
function  Oj-3(u)  is  the  solid  line;  O®'*  is  the  broken  line  of  x’s;  0®3  is  the 
broken  line  of  +  ’s;  O^-3  is  the  solid  line  with  blocks.  The  square  of  the  £ 2  norms 
of  these  functions  are:  1.0921,  0.9008,  0.4941,  0.2547. 


Fig.  25.  The  orthogonal  functions  Oy  (u)  for  j  =  1,2, 3, 4  and  h  =  0.1.  The 
function  Of ^(u)  is  the  solid  line;  0®  1  **  ^e  broken  line  of  x’s;  O®1  is  the 
broken  line  of  +  ’s;  O®1  is  the  solid  line  with  blocks.  The  square  of  the  £2  norms 
of  these  functions  are:  1.1034,  1.0654,  0.9317,  0.8654. 
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since  the  sequence  {0^(u)/ yj&^}  is  now  orthonormal.  If  the  sequence  {<£y}  is 
complete  for  5^,  one  can  regard  the  boundary  kernel  estimator  d^  as  being 
equivalent  to  a  damped  orthogonal  series  estimator.  If  the  sequence  is  not  com¬ 
plete  for  Sp,  this  interpretation  is  limited  to  the  projection  of  d^(w)  —  1  onto  the 
subspace  S§. 

The  fact  that  the  series  is  weighted  makes  the  usual  problem  of  the  choice 
of  truncation  point  much  less  difficult.  The  representation  for  df loo  provides  an 
alternate  and  intriguing  explanation  of  the  role  of  the  bandwidth  in  determining 
the  smoothness  of  the  estimate.  Larger  bandwidths  lead  to  smoother  estimates 
than  smaller  bandwidths  (recall  the  discussion  of  Subsection  2.4.3).  From  Fig¬ 
ures  23  through  25  it  is  apparent  that  the  higher  order  orthogonal  functions  are 
rougher  (higher  frequency)  than  the  lower  order  ones.  A  smaller  bandwidth  gives 
more  weight  to  the  higher  order  O^’s  than  a  larger  bandwidth,  hence  the  smaller 
bandwidth  is  capable  of  producing  rougher  estimates. 

One  can  also  appreciate  why  the  estimate  d ^  is  consistent  as  h  —►  0.  As  the 
bandwidth  shrinks,  more  and  more  of  the  basis  is  allowed  to  enter  and  contribute 
materially  to  the  shape  of  the  final  estimate.  In  the  limit,  any  shape  can  be 
duplicated.  This  also  explains  the  increase  in  variance  as  h  decreases.  With 
decreasing  bandwidths,  the  number  of  parameter  estimates  (components)  that 
make  up  the  estimate  is  increasing  as  more  are  given  significant  weight.  This 
results  in  an  increase  in  variance. 

A  A 

At  this  point,  it  may  be  wise  to  summarize  the  properties  of  d ^  and  d/,  jy. 
In  fixed  samples  each  estimator  is  biased;  this  is  to  be  expected  for  any  density 
estimator  [see  Rosenblatt  (1956)  or  Seheult  and  Quesenberry  (1971)].  For  fixed  h 
the  estimators  are  also  asymptotically  biased.  But  this  is  not  the  whole  story.  As 
a  practical  matter,  with  increasing  amounts  of  data  one  would  be  lead  naturally 
to  choosing  smaller  h  and  larger  M.  These  actions  reduce  the  amount  of  bias 
present.  Indeed,  such  a  process  will  even  attain  consistency  as  per  Theorem 
3.2.2.  The  distinction  is  between  what  one  assumes  to  prove  theorems  and  what 
one  does  in  implementing  a  procedure.  One  uses  fixed-h  results  such  as  Theorem 
3.2.3  and  Lemma  3.3.3  to  find  approximate  distributions.  One  merely  conceives 
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of  the  bandwidth  being  fixed  as  the  sample  size  increases.  Although  A jjyj  is  not 
selectable,  there  is  still  a  very  strong  analogy  between  the  treatment  of  A^j  and 
h.  It  is  assumed  that  A^j  converges  to  Aq,  0  <  Ao  <  1,  as  m  A  n  — ►  oo.  The 
limit  stochastic  processes  all  contain  the  term  Aq.  Yet  given  just  one  sample,  one 
can  only  conceptualize  a  convergence.  For  the  single  sample,  convergence  has  no 
meaning. 

A  A 

It  is  not  guaranteed  that  either  d ^  or  d^^  will  be  non-negative  and  integrate 
to  1.  The  estimates  may  themselves  not  be  densities.  If  the  density  is  not  non¬ 
negative,  then  having  it  integrate  to  1  is  of  little  benefit.  The  decision  to  employ 
such  estimators  was  made  in  Section  2.  Suffice  it  to  say  here  that  one  must  recall 
what  the  estimate  of  is  used  for.  This  relates  to  the  discussion  of  bias  as 
well.  The  important  aspect  of  d ^  and  d^  ^f  is  their  shape.  It  is  not  intended 
to  use  them  as  density  estimates.  One  will  not  simulate  random  variables  from 
them.  Their  important  interpretation  is  that  of  likelihood  ratio.  These  other 
properties  would  be  nice,  but  are  not  at  all  essential. 

More  importantly,  the  orthogonal  series  estimator  can  be  shown  to  satisfy 
invariance  even  in  finite  samples.  This  result  is  stated  as  Lemma  3.3.6. 

Lemma  3.3.6.  The  orthogonal  series  estimate  obeys  the  invariance  condition 

\N)dh,M^  +  f1  “  \N))^h,Af(w)  =  1 


m  finite  samples. 

The  function  d £  j^(w)  is  the  estimate  when  the  population  with  distribution 
function  F  is  called  the  first  sample.  The  function  dfi is  the  estimate  when 
the  population  with  distribution  function  G  is  called  the  first  sample.  This  result 
is  due  to  the  small  sample  mean  correction  to  the  components  which  caused  them 
to  be  invariant. 

It  is  now  known  also  that  d^  ^  is  a  damped  orthogonal  series  estimate  and 
that  the  weights  are  the  eigenvalues.  Only  a  subset  of  the  components  (or  fre¬ 
quencies)  making  up  d^w)  are  present  in  d^\f.  The  estimate  of  d ^  ^  can  be 
smoother  than  d ^  but  not  rougher  in  the  sense  higher  frequencies  may  be  absent 
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from  This  observation  will  lead  to  a  suggestion  for  choosing  the  bandwidth 

in  the  next  subsection. 

In  this  subsection,  the  subset  chi-square  test  was  applied  to  the  normalized 
components,  It  was  seen  that  this  test  leads  naturally  to  an 

orthogonal  series  estimator,  A  representation  for  this  estimator  in  terms 

of  the  original  data  was  found.  It  was  seen  to  be  a  damped  orthogonal  series 
estimator  with  the  weights  being  equal  to  the  eigenvalues  of  the  null  covariance 

A  A 

kernel.  The  properties  of  the  estimators  d >,  and  d^tM  were  discussed.  They 
were  seen  to  be  biased,  even  asymptotically-so  under  a  fixed  bandwidth  regime. 
However,  it  was  argued  that  such  a  statement  is  vacuous  in  the  sense  that  the 
bandwidth  will  naturally  decline  with  increasing  sample  size  and  that  fixed  band¬ 
width  theorems  are  useful  for  approximating  distributions.  One  is  not  meant  to 
seriously  consider  keeping  the  same  bandwidth  out  to  infinite  sample  sizes.  The 
goal  of  any  density  estimation  technique  should  be  to  select  the  bandwidth  to  fit 
the  data  parsimoniously  whatever  the  resulting  bandwidth  might  be. 

3.3.7.  Choosing  h  and  M.  In  this  subsection,  several  schemes  for  choosing 
h  and  M  are  examined.  One  can  choose  h  either  graphically  or  by  an  automatic 
criterion.  One  can  choose  M  to  cover  only  the  desired  alternatives  or  to  include 
all  eigenvalues  above  some  cutoff.  The  issues  involved  in  choosing  h  from  the 
data  on  the  properties  of  the  test  are  also  discussed. 

The  usual  density  estimator  has  either  h  or  Af ;  here  both  are  present.  This 
adds  flexibility  to  the  problem  and  is  not  a  hindrance.  There  are  several  philoso¬ 
phies  that  might  be  adopted.  The  first  is  based  on  a  remark  from  Subsection 
3.3.6  that  the  orthogonal  series  estimate  can  be  smoother  than  the  boundary 
kernel  estimate  but  not  rougher.  This  approach  would  suggest  choosing  h  to  un¬ 
dersmooth  the  data  (i.e.  d ^  is  slightly  too  rough)  and  then  choose  M  to  include 
all  the  components  whose  eigenvalues  exceed  some  cutoff  such  as  0.01  or  0.001. 
One  then  relies  on  the  subset  chi-square  test  to  include  or  exclude  the  compo¬ 
nents  as  appropriate.  Thus,  the  orthogonal  series  estimate  of  d^  has  available 
to  it  all  models  from  too  smooth  to  too  rough.  Table  6  gives  the  numbers  of 
eigenvalues  above  three  cutoffs  as  a  function  of  bandwidth. 


Ill 


Table  6 


Number  of  eigenvalues  of  the  null  covariance 
kernel  above  a  cutoff. _ 


Cutoff 

h 

0.01 

0.001 

0.0001 

0.5 

4 

5 

6 

0.4 

5 

6 

8 

0.3 

6 

8 

10 

0.2 

9 

12 

16 

0.1 

16 

23 

32 

Another  alternative  is  to  use  a  criterion  function  such  as  least  squares  cross- 
validation  (LSCV)  to  choose  the  bandwidth  and  then  to  include  all  the  compo¬ 
nents  above  some  cutoff.  The  properties  of  LSCV  have  not  been  established  in 
this  setting.  Anyone  proceeding  upon  such  a  path  should  use  caution.  These 
first  two  suggestions  are  similar  in  spirit. 

A  completely  different  approach  is  to  fix  M.  One  might  fix  M  based  on  the 
types  of  alternatives  one  is  considering,  for  example,  M  —  2  for  location  and 
scale.  Having  fixed  M  one  is  free  to  choose  h.  One  could  choose  h  based  on  fit 
or  based  on  the  types  of  distributions  one  wishes  to  best  protect  against,  that  is, 
one  could  choose  h  so  that  the  shapes  of  the  score  functions  are  pleasing.  Overall, 
there  seem  to  be  very  good  opportunities  to  direct  the  procedure  toward  more 
specific  alternatives  if  this  type  of  information  is  available. 

The  first  two  procedures  and  possibly  the  third  involve  the  selection  of  the 
bandwidth  based  on  the  data.  The  bandwidth  in  these  cases  is  not  only  random 
but  also  a  function  of  the  data.  It  is  of  interest  how  the  properties  of  the  test 
might  be  affected.  This  sort  of  problem  is  not  at  all  unheard  of  in  statistics. 
An  analogue  in  regression  would  be  the  distribution  of  the  parameter  {-statistics 
under  a  regression  selection  criterion  like  stepwise  regression.  If  the  bandwidth 
were  random  but  not  a  function  of  the  data,  then  the  answer  would  be  trivial: 
the  size  of  the  test  would  be  unaffected.  Of  course,  the  bandwidth  will  always 
depend  on  the  data  and  the  situation  is  more  complicated.  The  effect  of  a 
data-driven  bandwidth  depends  on  exactly  how  the  bandwidth  depends  on  the 
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data.  For  graphical  selection  procedures,  this  question  is  not  resolvable  because 
it  cannot  be  quantified.  For  criteria  functions  such  as  LSCV  there  is  hope  for 
an  answer  though  it  most  probably  would  result  from  simulation  rather  than 
analytic  techniques.  An  adjustment  for  a  data-driven  bandwidth  would  amount 
to  adjusting  a  in  the  critical  sequence  defining  the  subset  chi-square  test. 

These  simulations  are  not  performed  here.  Instead,  they  are  left  as  future 
work.  There  are  cases  where  completely  automatic  methods  may  be  appropriate; 
say,  when  the  output  density  is  required  as  input  to  another  procedure.  Other¬ 
wise,  those  using  an  automatic  method  who  do  not  check  the  fit  of  the  estimated 
density  may  be  rudely  surprised  as  such  methods  do  fail:  LSCV  is  known,  for 
example,  to  drastically  undersmooth  about  5-20%  of  the  time  [see  Hart  (1988)]. 
These  methods  are  more  properly  used  to  suggest  choices  of  bandwidths.  It  is  up 
to  the  user  of  these  techniques  to  make  the  final  choice  based  on  other  criteria 
such  as  fit. 

One  of  the  strong  points  of  nonparametric  density  estimation  in  general  is 
its  ability  to  suggest  different  models.  The  methodology  here  is  no  different. 
Aside  from  altering  bandwidths,  the  function  C(k)  is  capable  of  suggesting  quite 
different  shapes  for  a  given  bandwidth.  If  the  function  is  nearly  level,  then  several 
quite  different  models  may  result.  However,  caution  should  be  made  against  one 
particular  abuse.  Sometimes  it  will  occur  that  the  subset  chi-square  will  fail  to 
reject.  Upon  examining  the  components,  one  may  see  that  by  decreasing  M  or 
changing  h  that  the  test  would  reject.  It  is  statistically  dishonest  to  make  such  a 
modification  and  declare  significance.  Such  a  procedure  can  drastically  alter  the 
properties  of  the  test.  Since  some  choice  of  M  and  h  must  be  made,  this  choice 
should  be  made  before  the  subset  chi-square  test  is  run.  These  procedures  will 
still  affect  the  properties  of  the  test,  but  they  will  do  so  in  a  much  less  egregious 
manner. 

In  this  subsection,  several  different  methods  of  choosing  the  bandwidth  and 
the  truncation  point  were  examined.  Which  to  use  is  determined  by  the  objectives 
of  the  researcher.  The  issues  involved  in  the  effect  of  a  data-driven  bandwidth  on 
the  size  of  the  subset  chi-square  test  were  discussed.  In  order  to  adjust  the  size 


of  the  test,  one  would  have  to  pick  a  specific,  quantifiable  choice  criterion.  The 
necessary  adjustment  would  most  probably  have  to  be  determined  by  simulation 
techniques. 

3.3.8.  Summary  of  the  Unified  Testing  and  Estimation  Procedure.  The  pieces 
of  the  unified  testing  and  estimation  procedure  have  been  scattered  throughout 
this  section.  They  are  brought  together  in  this  subsection.  The  procedure  is  seen 
to  fulfill  the  features  outlined  in  Subsection  2.2.6  and  Section  1.  This  procedure 
is  summarized  by  the  following  list: 

1.  Univariate  Analysis 

2.  Preliminary  Two  Sample  Analysis 

3.  Choosing  M  and  h 

4.  Executing  the  Subset  Chi-Square  Test  on  the  Components 

5.  Plotting  the  Orthogonal  Series  Estimate  of  the  Comparison  Density 

Any  two  sample  analysis  should  start  with  three  univariate  analyses:  the  two 
individual  samples  and  the  pooled  sample.  Statistics  such  as  the  mean,  median, 
standard  deviation,  twice  the  interquartile  range  and  trimmed  means  should  be 
examined.  Identification  quantile  plots  [Parzen  (1979)]  should  be  constructed. 
The  philosophy  is  that  before  asking  if  F  and  G  are  equal,  it  is  best  to  investigate 
the  properties  of  each  on  their  own.  Examining  the  pooled  sample  can  highlight 
distinctions  between  the  two. 

During  the  Preliminary  Two  Sample  Analysis  stage,  visual  indications  of  the 
fit  of  the  two  samples  and  standard  two  sample  statistics  are  given.  An  overlay 
plot  of  the  two  identification  quantile  plots  is  given  as  is  an  QQ  plot.  These  two 
plots  remove  the  effect  of  location  and  scale:  they  compare  the  shapes  of  the 
distributions.  Traditional  statistics  such  as  the  Cram6r-von  Mises  or  Anderson- 
Darling  are  also  given  at  this  stage. 

At  this  point,  M  and  h  need  to  be  chosen  by  one  of  the  methods  outlined 
in  Subsection  3.3.7.  In  the  example  in  Section  5,  they  will  be  chosen  by  the  first 
method  outlined.  The  components  are  then  calculated  and  the  subset  chi-square 
test  applied  to  them.  The  criteria  function,  C[k),  as  defined  in  Subsection  3.3.5 
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should  be  displayed.  The  graph  should  include  a  horizontal  reference  line  at  0. 
If  C(k)  does  not  exceed  zero,  the  null  hypothesis  cannot  be  rejected  and  the  esti¬ 
mate  of  the  comparison  density  is  the  uniform  density.  If  C(k)  exceeds  0  at  one  or 
more  k  values  then  estimates  of  the  comparison  density  based  on  the  eigenfunc¬ 
tions  should  be  displayed.  The  estimate  based  on  the  subset  which  maximizes 
C(k)  is  always  displayed.  Others  may  be  displayed  at  the  user’s  discretion  based 
on  the  shape  of  C(k).  Along  with  these  graphs  a  list  of  which  components  are 
significant  and  the  components  themselves  should  be  given.  Each  graph  should 
include  a  home  ital  reference  line  at  1.  The  boundary  kernel  estimate  can  also 
be  overlaid  for  reference. 

Note  that  this  procedure  does  indeed  fulfill  the  criteria  outlined  in  Subsection 
2.2.6  and  Section  1.  It  is  certainly  a  graphically  oriented  technique.  The  subset 
chi-square  test  applied  to  the  components  is  also  a  selection  procedure  for  a  model 
of  d^j(tu).  If  the  null  hypothesis  cannot  be  rejected  the  model  is  uniformity; 
if  the  null  hypothesis  is  rejected  the  model  is  the  orthogonal  series  estimate 
corresponding  to  those  components  found  significant.  The  test  is  omnibus.  In 
fact,  the  breadth  of  the  class  protected  against  is  under  the  control  of  the  user. 
The  distribution  of  the  components  is  nonpaxametric  distribution  free  under  H0 
since  they  are  linear  rank  statistics.  There  are  as  few  restrictions  placed  on  F 
and  G  as  possible  while  maintaining  weak  convergence  results  for  the  comparison 
distribution  empirical  process.  Finally,  the  estimation  of  the  relation  of  F  to  G 
is  given  by  the  estimate  of  the  comparison  density.  All  the  requirements  are 
fulfilled  by  this  methodology. 

4 

In  summary,  this  section  has  detailed  the  theoretical  and  computational  as¬ 
pects  of  the  boundary  kernel  estimate  of  the  comparison  density  and  tests  of 
its  uniformity.  The  section  started  by  giving  pointwise  results  for  the  boundary 
kernel  estimator  under  a  bandwidth  shrinking  to  zero,  the  pointwise  asymptotic 
normality  of  the  boundary  kernel  estimator  under  H0;  its  pointwise  consistency 
and  invariance  under  general  alternatives  was  shown.  A  stochastic  process  called 
the  kernel  density  process  was  defined  from  the  boundary  kernel  estimator.  Con¬ 
ditions  were  given  for  its  weak  convergence  to  a  limiting  process  under  a  fixed 
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bandwidth.  A  rationale  for  fixing  the  bandwidth  was  given. 

Tests  of  the  null  hypothesis  were  based  on  the  kernel  density  process.  It  was 
argued  that  the  best  strategy  was  to  base  any  test  on  a  fixed  number  Af  of  the 
components  of  the  kernel  density  process.  The  components  were  defined  as  the 
inner  product  of  the  eigenfunctions  of  the  covariance  kernel  of  the  kernel  density 
process  under  H0  and  the  boundary  kernel  estimate  less  1,  d^w)  -  1.  Properly 
scaled,  these  components  converge  jointly  to  iid  N(0,1)  random  variables  under 
H0.  The  components  are  interpretable  both  as  generalized  Fourier  coefficients 
and  as  rank  statistics. 

A  new  test,  the  subset  chi-square  test,  was  introduced  and  compared  to  ex¬ 
isting  tests.  This  test  was  then  applied  to  the  components.  The  subset  chi-square 
test  was  seen  to  have  several  desirable  properties.  First,  it  considers  the  compo¬ 
nents  in  combination  not  just  singly.  Second,  it  indicates  which  components  are 
deemed  large.  Third,  it  lends  itself  to  graphical  display.  The  subset  chi-square 
test  was  also  seen  to  suggest  an  orthogonal  series  estimate  of  the  comparison 
density  based  on  the  components  and  the  eigenfunctions.  The  relation  between 
the  orthogonal  series  estimate  and  the  boundary  kernel  estimate  was  explored. 

Methods  of  choosing  the  bandwidth  and  truncation  point  were  examined. 
The  implications  of  data  based  choices  of  the  bandwidth  were  also  discussed. 
Finally,  the  methodology  was  summarized.  It  was  seen  to  truly  meet  the  criteria 
outlined  in  Section  1  and  Subsection  2.2.6. 
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4.  POWER  AND  SIZE  STUDIES 


4.1.  Introduction 

In  this  section,  power  and  size  studies  are  conducted.  Subsection  4.2  covers 
power  studies;  subsection  4.3  covers  size  studies.  Subsection  4.2  derives  and 
explains  the  theoretical  concepts  necessary  for  defining  power  functions  and  the 
asymptotic  relative  efficiency  between  two  rank  statistics.  Also  detailed  are  the 
simulation  and  numeric  techniques  used  to  actually  find  the  power  functions. 

The  asymptotic  relative  efficiencies  of  the  first  two  components  are  compared 
to  standard  rank  tests.  The  bandwidth  is  seen  to  have  an  effect  on  this  efficiency. 
The  first  component  is  generally  less  efficient  than  the  standard  rank  statistics 
and  the  second  more  efficient.  It  is  found  that  the  optimal  choice  of  bandwidth 
is  not  necessarily  the  same  for  both  location  and  scale  alternatives  of  the  same 
underlying  distribution.  A  good  compromise  choice  of  bandwidth,  however,  can 
be  made  for  the  distributions  considered. 

The  subset  chi-square  test  applied  to  the  components  of  the  kernel  density 
process  is  found  to  have  very  good  power  properties.  The  Cram4r-von  Mises 
and  Anderson-Darling  statistics  have  good  power  against  the  location  alterna¬ 
tives  examined.  However,  when  the  alternative  starts  to  principally  affect  higher 
components,  these  two  statistics  have  much  poorer  power  properties.  The  subset 
chi-square  test  is  equally  good  against  any  alternative  which  affects  components 
it  considers.  It  outperforms  the  Cram4r-von  Mises  and  Anderson-Darling  statis¬ 
tics  by  wide  margins  for  alternatives  influenced  mainly  by  the  fourth  and  higher 
components. 

It  is  seen  that  the  key  to  the  power  of  the  subset  chi-square  is  the  choice 
of  truncation  point  (Af)  and  bandwidth  ( h ).  Choosing  the  truncation  point 
too  large  ( h  too  small)  reduces  power  because  the  signal  is  swamped  by  noise. 
Choosing  the  truncation  point  too  small  ( h  too  large)  reduces  power  because 
the  signal  is  missed  by  the  test.  Careful  selection  of  the  truncation  point  and 
bandwidth  should  chart  a  course  between  these  twin  abysses. 
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Subsection  4.3  covers  the  small  sample  size  of  the  subset  chi-square  test 
applied  to  the  components.  The  size  remains  very  close  to  the  appropriate  value 
even  for  samples  as  small  as  n  =  m  =  5.  The  size  deviates  significantly  when  one 
of  two  situations  occurs.  The  first  occurs  when  the  bandwidth  is  clearly  chosen 
to  be  too  small.  For  instance,  the  size  deviates  substantially  from  its  nominal 
value  if  h  =  0.3  or  h  =  .2  is  used  for  n  =  m  =  5.  One  would  never  choose  such  a 
small  bandwidth  for  this  sample  size  in  practice.  The  second  occurs  when  m  is 
very  small  (say  m  —  5)  and  n  »  m  (say  n  =  100).  Again,  one  doesn’t  expect  to 
see  such  cases  very  often  in  practice. 

4.2.  Power  Studies 

4.2.1.  Introduction.  Subsection  4.2  investigates  the  power  of  the  procedures 
discussed  in  Section  3  along  with  the  the  Cram4r-von  Mises  and  Anderson- 
Darling  statistics.  The  power  functions  in  this  subsection  are  asymptotic. 
Subsection  4.2.2  discusses  the  notion  of  a  local  alternative.  The  main  result 
of  this  subsection  is  a  theorem  stating  the  conditions  under  which  CDojy  = 
y/N[Dff{ u)  —  u]  still  converges  weakly  under  local  alternatives.  Local  location, 
scale  and  Fourier  alternatives  are  defined.  The  asymptotic  relative  efficiency  of 
two  rank  statistics  is  defined. 

Subsection  4.2.3  gives  the  methods  to  be  used  to  find  power  functions.  A 
method  based  on  simulation  is  described  for  the  subset  chi-square  test.  A  method 
to  numerically  invert  the  characteristic  function  is  also  described.  A  theorem 
about  its  numerical  consistency  is  proved.  This  method  is  used  to  find  the  power 
functions  for  the  <p Cramer-von  Mises,  and  Anderson-Darling  statistics. 

Subsection  4.2.4  demonstrates  the  calculations  necessary  to  check  the  con¬ 
ditions  for  weak  convergence  of  CDojy.  These  conditions  are  checked  for  Cauchy 
location  and  scale  alternatives.  Subsection  4.2.5  presents  power  curves  for  two 
distributions  for  both  location  and  scale  alternatives.  Curves  for  two  Fourier 
alternatives  are  also  given.  The  subset  chi-square  test  applied  to  the  components 
is  seen  to  perform  credibly,  particularly  for  alternatives  stressing  the  second 
and  higher  components.  Asymptotic  relative  efficiencies  comparing  the  first  two 


components  to  standard  rank  tests  for  four  underlying  distributions  are  given. 
The  bandwidth  is  seen  to  make  a  difference.  No  statistic  dominates  over  all  the 
different  distributions. 
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4.2.2.  Weak  Convergence  of  CDojy  Under  Local  Alternatives.  The  weak  con¬ 
vergence  of  the  stochastic  process  CDojy  to  a  limiting  process  L+  (1  —  Aq)-1/2A 
under  local  alternatives  is  proved  in  this  subsection.  This  is  very  important  since 
under  fixed  alternatives  one  expects  the  process  to  become  unbounded  as  the  sam¬ 
ple  sizes  increase.  From  the  results  of  Section  2,  one  can  only  claim  that  CDo^y 
converges  weakly  under  H0,  because  one  has  the  identity  CDojy(u)  =  CD jy(u). 
It  is  the  goal  of  this  subsection  to  broaden  the  class  for  which  convergence  is 
claimed  to  include  local  alternatives. 

The  discussion  opens  by  examining  the  concept  of  a  local  alternative.  The 
statement  of  the  theorem  on  weak  convergence  follows.  Types  of  local  alternatives 
satisfying  the  conditions  are  discussed.  Local  location  and  scale  alternatives 
are  defined.  Also  introduced  is  the  strategy  of  defining  the  local  alternative  by 
parametrically  defining  its  limiting  bias  function  and  not  specifying  the  sequence 
of  underlying  distributions. 

A  local  alternative  is  an  alternative  in  which  the  distribution  of  Yi,. . .  ,Yn 
depends  on  n  so  one  has  Yn\, . . . ,  Ynn  are  iid  according  to  the  distribution  func¬ 
tion  The  problem  is  that  for  fixed  G  ^  F,  a  test  is  either  consistent  or 

inconsistent.  If  it  is  consistent,  the  asymptotic  power  function  is  1,  if  it  is  in¬ 
consistent  the  power  is  equal  to  the  size  of  the  test.  Hence,  asymptotic  power 
curves  drawn  for  fixed  alternatives  are  very  uninteresting.  Instead,  one  chooses 
a  sequence  {G(nj}  of  alternatives  such  that  — ►  F  as  n  — *  oo.  If  G^  con¬ 

verges  to  F  at  the  correct  rate,  several  good  things  happen.  First,  the  power 
functions  are  not  degenerate,  that  is,  they  are  not  uniformly  equal  to  a  or  1. 
Second,  statistics  have  the  same  distribution  as  under  H0  with  the  exception  of 
the  addition  of  a  non-zero  mean.  Typically,  this  makes  power  curves  much  easier 
to  construct. 
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For  the  purposes  of  local  alternatives,  let 

*nl»  •  •  • , Inn  be  iid  , 

#(JV)(*)  =  +  (!  ~ 

and 

f(W)W  =  FQHm(»). 

It  is  assumed  that  G^(x)  — ►  F(x)  at  a  rate  so  that  the  limit  function,  A(u), 
defined  by 


(4.2.1)  A(u)  =  mAlimoov/«(^(7V)(«)  ~  “1 

exists  and  is  continuous  with  A(0)  =  A(l)  =  0.  Although  not  specified  in  the 
notation,  A(u)  will  depend  on  Ao  as  well  as  a  parameter  7  which  indexes  the  local 
alternative.  The  function  A(u)  is  the  bias  function  and  Theorem  4.2.1  gives  the 
conditions  under  which  CDojy  =>  L  +  (1  —  Aq)~1/,2A. 

Theorem  4.2.1.  Assume  that  there  exists  A(u)  such  that  ||\/n[Zfyv)(u)  —  u]  - 
A(u)||  — *•  0  as  mAn  — *  00  and  suppose  there  exists  sequences  of  constants  {a^j} 
and  such  that 

P(“(W)<  <?#(')  <6(W),°<f<l]-l 

as  m  A  n  — ►  00.  Let  e and  be  defined  as 

e(AT)  =  QHm(l/N)  A  0(W), 

/(AT)  =  -  1/JV)  V  4m, 


and  suppose  that 


/(*) 

sup  - r-r 


inf 

e(N)<X<f(N) 


9(n)ix) 
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os  m  A  n  — *  oo.  Then 

CDojy  =>  L+(1-A0)_1/2A 

in  (C[0, l], Cp,p)  as  m  A  n  — *  oo  where  L(u)  =  and  B(u)  is  a 

Brownian  bridge  process. 

The  norm  ||  •  ||  is  the  sup-norm.  This  theorem  is  the  basic  building  block  for 
deriving  power  functions.  It  follows  from  a  proof  entirely  analogous  to  that  of 
Theorem  3.2.3  that  under  the  same  conditions  as  Theorem  3.2.3 

KDPojy^  =>  KDPo/,  +  6h, 

where 

hW  =  (1  -  Ao)-1/2  jT1  tfK  (^)  A (u)du, 

and  KDPo/i  is  the  process  KDP/,  when  H0  is  true. 

To  this  point,  nothing  has  been  said  about  the  character  of  or  the  rate 
at  which  it  must  converge  to  F  for  a  non-degenerate  limit  function  to  exist.  These 
details  are  now  given.  For  a  local  location  alternative,  define  by 

G(n)(*)  =  F(x  -  7/y/n). 

Prihoda  (1981)  shows  that  the  limit  function  is 

(4.2.2)  A(u)  =  (1  -  \o)lfQF{u), 

although  a  proof  of  this  is  embedded  in  the  proof  of  Lemma  4.2.1  (below)  as  well. 
Pointwise  convergence  is  easily  shown,  however  Theorem  4.2.1  calls  for  uniform 
convergence  since  the  sup-norm  is  used.  Lemma  4.2.1  gives  the  conditions  on  F 
under  which  pointwise  convergence  implies  uniform  convergence  for  local  location 
alternatives. 

For  a  local  scale  alternative,  is  defined  by 

G'(n)(X)  =  F(i  +  7/v^)- 


.  J 
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Prihoda  (1981)  shows  the  limit  function  in  this  case  to  be 
(4.2.3)  A(u)  =  (1  -  XohQF(u)fQF(u). 

Under  certain  conditions  on  F,  the  convergence  for  location  and  scale  alternatives 
can  be  shown  to  be  uniform.  Lemma  4.2.1  gives  the  result. 

Lemma  4.2.1.  Let  A(u)  be  given  by  (4.2.2)  for  a  local  location  alternative  and  by 
(4-2.8)  for  a  local  scale  alternative  and  suppose  A (u)  is  continuous  and  A(0)  = 
A(l)  =  0.  Suppose  that  f'  exists  and  is  bounded.  Assume  also  that  =  \q. 
Then 


llv^joM  ~  u) _  A(«)ll  -*■ 0 

as  m  A  n  — ►  00. 

These  conditions  are  easily  satisfied  by  the  distributions  to  be  considered.  The 
remaining  conditions  of  Theorem  4.2.1  must  be  shown  on  a  case  by  case  basis. 
These  conditions  are  easier  to  show  than  uniform  convergence. 

There  is  another  possible  method  for  defining  local  alternatives.  This  is  to 
ignore  the  underlying  distributions  F  and  and  to  define  the  limiting  bias 
function  A(u)  parametrically.  A  convenient  representation  is 

k 

(4.2.4)  A(u)  =  (1  -  Aq)7  ^  aj  sin  7 rju. 

j=  1 

The  function  A(u)  preserves  its  known  properties:  A(u)  is  continuous  with 
A(0)  =  A(l)  =  0.  This  procedure  is  attractive  for  creating  alternatives  other 
than  location  and  scale.  As  might  be  anticipated  from  the  discussion  in  Sec¬ 
tion  3,  location  and  scale  alternatives  affect  mainly  the  first  two  components. 
Defining  alternatives  in  this  way  allows  one  to  easily  put  weight  on  the  higher 
order  components.  Alternatives  constructed  in  this  way  will  be  called  Fourier 
alternatives  since  a  sine  basis  is  used  to  define  A(u). 

At  this  point,  it  is  now  possible  to  define  the  asymptotic  relative  efficiency 
(ARE)  of  two  rank  statistics.  Let  be  a  rank  statistic  with  score  function 


Ji(tt)  and  let  Rtf 2  be  a  rank  statistic  with  score  function  ^2 (u) •  Assume  J\{u) 
and  J2(u)  are  differentiable,  then 
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Rm  =  ~ 


[l  Jl(u)CDoN(u) 

Jo 

-  f  J[L[u)du  -  (1 
Jo 


as  m  An  -»  00  for  »  =  1,2,  where  the  convergence  in  distribution  follows  from 
the  weak  convergence  of  CDojy  to  L  +  (1  -  Aq)-1/2  A.  Restating  this  result,  one 
has 

«W  +  (l-*o)~1/a/o  JL,  n(0> !)_ 

4(“)a<(“  -  [/o'  AW**]  ) 

as  m  A  rt  — ♦  00.  The  conditions  of  Noether’s  theorem  [see  Randles  and  Wolfe 
(1979),  page  147]  which  justify  defining  ARE’s  are  clearly  met.  The  asymptotic 
relative  efficiency  of  ifyyj  to  Rn2>  denoted  ARE(Rtfi,  Rtf2)i  *s  defined  as 


where 


ARE{RNl,RN2) 


JC  =  (1~^o)  1/2  fp1  J'(u)A(u)du  ,  =  1  2 

]/ (fo1  Mu)2du  -  [/o1  Jt  («)<*«]  *) 

and  /Cj  is  known  as  the  efficacy  of  the  rank  test.  If  ARE(i2jyi, Rffz)  >  then 
Rftl  is  asymptotically  relatively  efficient  compared  to  Rff2 • 

The  motivation  for  these  definitions  is  straightforward  once  one  realizes  that 
the  asymptotic  power  function  for  Rffi  is  calculated  as 


0{“l)  =  P-rfReject  H0] 

=  51  '«/») 

=  lim  1  -  P[-*0/2  ~  <  ZNi  -  k i  <  za/2  -  k{\ 

m/\Ti—+oo  1  ' 

=  1  -  [*(*a/2  ~  *»)  -  *(~za/2  ~  *t)I. 
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where  ox  is  the  denominator  in  the  definition  of  kx  and  Z^x  =  Rtfx/ ox .  Increasing 
kx  causes  $  to  be  evaluated  further  in  the  tail  which  reduces  the  quantity  in  the 
brackets  and  increases  the  power  function.  The  efficacy,  kx,  does  provide  an 
ordering  of  the  powers  and  is  meaningful. 

Power  functions  and  ARE’s  are  given  in  Subsection  4.2.5  for  local  scale 
and  location  alternatives  corresponding  to  four  underlying  distributions:  normal, 
logistic,  Laplace,  and  Cauchy.  Power  functions  are  also  calculated  for  two  Fourier 
alternatives. 

Summarizing,  in  this  subsection  the  concept  of  local  alternative  was  defined. 
A  theorem  giving  the  weak  convergence  of  the  comparison  distribution  empirical 
process  was  given.  Local  location  and  scale  alternatives  were  defined  and  lemmas 
concerning  the  uniform  convergence  of  the  biases  were  proved.  Local  Fourier 
alternatives  were  defined.  The  ARE  of  two  rank  statistics  was  defined  and  the 
relevance  of  the  measure  illustrated.  In  Subsection  4.2.3,  techniques  for  actually 
calculating  the  power  curves  are  given. 

4.2.3.  Computing  Power  Curves.  Power  curves  are  constructed  for  the  subset 

n 

chi-square  test  applied  to  the  components  of  the  kernel  density  process,  <p%, 
CVM  (Cramer- von  Mises  statistic),  and  AD  (Anderson-Darling  statistic).  Two 
separate  techniques  are  used.  For  the  subset  chi-square  procedure,  simulation 
methods  are  used.  For  the  others,  the  characteristic  function  is  numerically 
inverted. 

Since  finding  the  percentage  points  of  the  subset  chi-square  test  under  H0 
required  simulation  techniques,  it  is  not  at  all  surprising  that  finding  the  power 
function  does  as  well.  The  technique  is  as  follows.  The  parameters  M  and  h  are 
given.  The  asymptotic  bias  for  each  of  the  M  components,  z'Nx,  is 

b*j  =  JQ 

where  6 ^  is  as  defined  in  Subsection  4.2.2.  The  normalized  components  have  bias 


•  • 


Although  the  notation  does  not  reveal  it  explicitly,  A,  6^  and  thus  bj  all  depend 
on  the  parameter  7  which  indexes  the  local  alternative  [see  equations  (4.2.2), 
(4.2.3),  and  (4.2.4)].  To  find  the  power  function,  0(1),  one  needs  to  find  the 
probability  of  the  subset  chi-square  test  rejecting  H0  when  given  M  indepen¬ 
dent  normal  random  variables  with  variance  1  and  means  b},  j  —  1, . . . ,  M.  The 
beauty  of  this  procedure  is  that  one  need  not  take  large  samples  from  the  un¬ 
derlying  distributions  F  and  G^n j ,  compute  the  components  and  then  apply  the 
subset  chi-square  test.  One  need  only  simulate  the  limiting  distribution  of  the 
M  components  to  obtain  the  asymptotic  power  function. 

The  simulation  is  conducted  in  the  same  manner  as  that  which  generated 
Figures  20  and  21.  For  each  set  of  M  iid  N(0,1)  realizations,  an  indicator  function 
is  set  to  1  for  rejection  and  0  otherwise  at  each  7  =  7,  =  (*'  —  l)/35  for  *  = 
1,...,36.  These  individual  functions  are  then  averaged  over  10,000  realizations 
to  arrive  at  the  estimated  power  function.  A  confidence  interval  for  any  point 
along  the  estimated  function  having  at  least  a  95%  coverage  probability  has  a 
half-width  of 


*0.975 


1 

4  •  10,000 


0.0098. 


There  are  numerical  methods  of  approximating  the  power  functions  of  <p 
CVM,  and  AD  so  one  needn’t  resort  to  simulation  methods.  Each  of  these  statis¬ 
tics  is  representable  as  a  weighted  infinite  sum  of  squares  of  independent  normal 
random  variables  under  H0  and  local  alternatives.  Under  H0  these  normal  ran¬ 
dom  variables  are  iid  N(0,1);  under  local  alternatives  they  have  nonzero  means. 
Since  the  weights  on  the  squared  normals  decrease  very  rapidly,  these  numerical 
methods  truncate  the  infinite  series  at  some  point  Q.  The  distribution  of 


t = f;  tjZ} 


j= 1 


}  =  1 


is  approximated  by 
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Reflect  back  to  the  discussion  of  the  space  spanned  by  the  eigenfunctions  in  Sec¬ 
tion  3.  The  distribution  of  <p ^  is  to  be  approximated  by  the  distribution  of  the 
projection  of  the  process  KDPo^y  ^  onto  a  subspace  spanned  by  the  eigenfunc¬ 
tions.  For  the  purposes  of  the  approximation,  it  is  clear  that  whether  or  not  the 
eigenfunctions  form  a  complete  basis  for  Sp  is  irrelevant. 

Durbin  and  Knott  (1972)  take  the  approach  described  here  in  finding  the 
distribution  of  various  elements  of  the  CVM  statistic.  They  do  add  one  more 
term,  aX,  where  X  is  distributed  as  and  is  independent  of  Z\, . . . ,  Zq.  They 
choose  a  and  v  so  that  T  and  Tq  have  the  same  mean  and  variance. 

The  approach  taken  by  Durbin  and  Knott  is  adopted  here.  They  invert 
the  characteristic  function  of  Tq  by  numerical  methods.  The  methods  used 
by  Durbin  and  Knott  were  originally  proposed  by  Imhof  (1961)  and  Slepian 
(1958).  These  methods  are  geared  specifically  to  quadratic  forms  of  normal 
random  variables.  They  return  the  distribution  function  of  Tq,  F(x),  given  the 
0/s.  They  are  somewhat  tedious  in  that  one  must  perform  numerical  integration 
for  each  i  for  which  F[x)  is  desired.  If  the  entire  distribution  function  is  needed, 
this  can  result  in  quite  a  lot  of  computation.  A  different  method  is  used  here; 
one  that  applies  to  a  much  broader  range  of  cases  than  the  methods  of  Imhof 
and  Slepian.  This  new  method  returns  the  density  function  at  a  range  of  values, 
not  just  at  a  single  value. 

This  method  uses  the  fast  Fourier  transform  (FFT)  to  numerically  invert 
the  characteristic  function.  As  obvious  as  this  idea  sounds,  it  doesn’t  appear 
in  the  literature  in  this  form.  Silverman  (1982)  [see  also  Jones  and  Lotwick 
(1984)]  describes  an  algorithm  which  uses  the  FFT  to  numerically  invert  the 
characteristic  function  of  a  kernel  density  estimate.  Otherwise,  the  FFT  has  not 
been  used  in  this  manner . 

To  describe  the  algorithm,  start  by  assuming  f(x)  is  a  continuous  density 
function  and  that  <f>x{t)  is  its  characteristic  function.  Then  one  has  [see  Parzen 
(1962b),  page  12] 
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* 


« 


+  2  WIN) 

j= 0 

(4.2.5)  =  '-2xM’ilN  ZU), 

*N  3=0 

where  ^(7)  =  <j>x{—M  +  2Mj/N).  The  integers  M  and  N  are  unrelated  to  their 
earlier  uses.  There  are  two  sources  of  error  here,  that  due  to  truncation  and  that 
due  to  approximation. 

Consider  the  relation  of  (4.2.5)  to  the  inverse  FFT  of  Z(0),...,Z(N  —  1). 
The  inverse  FFT  is 

N-l 

(4.2.6)  l(k)  =  £  e-™*/" Z(j), 

j=0 

for  k  =  0, . . . ,  [N/ 2].  Comparing  (4.2.5)  and  (4.2.6),  one  sees  that  they  are  almost 

the  same.  Equating  the  exponents  of  the  two  exponentials, 

2 xMj  _  2njk 

N  ~  ~1T' 
irk 

X~  M’ 

for  k  =  0, . . . ,  \N]2\.  This  defines  the  1  values  at  which  an  approximation  results. 
Substituting  these  x  values  into  eiMx,  one  finds  that 


xMx  _  xMxk/M 
c  c 


=  '•*‘={-4 


k  even, 
k  odd. 


The  estimate  /yj^  of  f(x)  at  the  points  x *  =  irk/M,  k  =  0, . . . ,  [N/ 2]  is 

/ NM{xk)  =  Re[/(fc 

The  estimate  of  the  distribution  function  F,  call  it  FjfM ,  cam  then  be  found  from 
fNM  by  numerical  integration.  Using  the  trapezoidal  rule,  the  estimate  works 
out  to  be 

k  -  1 

(4.2.7)  =  E /(*,') 

b'=1 


> 
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In  this  work,  it  is  known  that  the  distributions  have  support  on  the  positive  reals 
so  that  getting  /(x)  at  x  —  irk/M  is  not  a  difficulty.  In  other  work,  one  could 
use  the  forward  FFT  to  get  the  density  at  x  =  -irk/M.  Or  one  could  use  the 
characteristic  function  of  X  ±  b  to  slide  the  areas  of  interest  within  the  range  of 
the  Xfc  values. 

The  approximation  used  in  (4.2.5)  appears  to  be  the  composite  rectangular 
rule,  but  it  is  not.  Because  of  the  symmetries  of  the  characteristic  function, 
(4.2.5)  is  equivalent  to  the  trapezoidal  rule.  Let 


g(t,x)  =  Re|e  ttx0x(*)] 

=  cos  tx  •  Re[0jr(f)]  +  sintx  •  Im[^_y(t)], 

so  that  /(x)  =  jj:  flf(t,  x)dt.  Note  that  g(—t,x)  =  g(t,x)  for  all  x  and  t.  The 

trapezoidal  rule  for  integrating  g{t,x)  with  respect  to  t  is 


=  £ 


+  2Mj/N , x)  - 

>0 


M  ^  } 

3=0 

since  g(— M,x)  =  g(M,x).  This  last  sum  is  precisely  (4.2.5). 

The  errors  in  equations  (4.2.5)  and  (4.2.7)  would  seem  to  be  working  against 
each  other.  To  make  (4.2.5)  more  accurate  one  wants  M/N  small;  to  make 
(4.2.7)  more  accurate  one  wants  M  large  which  means  N  may  need  to  be  huge 
to  make  M/N  small.  One  must  also  select  M  large  enough  so  that  truncation 
errors  in  the  original  integral  defining  /  don’t  dominate.  These  forces  can  be 
balanced,  however.  Theorem  4.2.2  gives  a  numerical  consistency  result  for  the 

A 

approximation  Fpf\f. 


Theorem  4.2.2.  Let  <j>x(t)  be  the  characteristic  function  of  f  with  support  on  the 
positive  reals  and  g{t,x)  be  as  defined  above.  Suppose  that  g{t,x)  is  twice  differ¬ 
entiable  in  t  (except  possibly  at  t=0,  in  which  case  the  left  and  right  derivatives 
must  exist )  and  that 


sup 

t#0,x€[0,fcj 


.  d  .  , 


<  oo 
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and 

f  |*2Re[^jr(t)]|<fc  <  oo, 

J  — oo 

/  |t2Im[^(t)H«it  <  oo. 

J— oo 

Suppose  also  that  M^fN2  —*  0  as  M,  N  —>  <x,  then 

FffM(b)  -  F(6) 

as  N,M  — *  oo  /or  6  >  0. 

Theorem  4.2.2  says  not  only  that  the  estimate  of  the  distribution  function 
converges  but  also  gives  a  bound  on  the  relation  of  M  to  N.  Rates  of  convergence 
are  harder  to  derive.  They  are  probably  less  useful  too  since  such  rates  are  upper 
bounds  and  usually  not  very  good  ones.  In  applying  this  procedure,  one  typically 
observes  that  if  /jvAf(xJfc)  <  f(xk)  then  fttM{xk+l)  >  f[x jfc+i)-  The  trapezoidal 

A 

rule  applied  to  /nm  gives  a  result  remarkably  close  to  that  which  would  result  if 
/  had  been  used.  The  procedure  seems  quite  robust  to  the  choice  of  N  and  M. 
However,  if  really  bad  choices  are  made  the  result  is  usually  quite  apparent.  The 
estimated  density  fjfM  is  very  wild  looking  and  doesn’t  come  close  to  integrating 
to  1. 

To  test  of  this  procedure,  consider  using  it  to  invert  the  characteristic  func¬ 
tion  of  the  CVM  statistic  under  H0.  In  Lemma  4.2.2,  the  characteristic  function 
<t>q(t)  of  Tq  =  Y!j=i  OjZj  is  given. 

Lemma  4.2.2.  Let  Zi,...,Zq  be  independent  with  Zj  ~  N(6y,  1).  Let  6\ 
be  a  sequence  of  constants.  Then  the  characteristic  function  of 

Q 

J=i 

is 

Q  (  1  V/2 

(4.2.8)  <fiQ(t)  =  Y[  [  exp(ci**/(l  " 


where  Cj  = 

Equation  (4.2.8)  is  not  in  a  useful  form.  Through  purely  formal  manipulations, 
one  arrives  at 

1  =  1  .  20jt 

1  -  2 itOj  1  +  402*2  *1  +  4^yt2  ’ 

One  also  needs  the  square  root  of  a  +  bi  which  is  c  +  di,  where 

d  = 

c  = 

At  this  point,  equations  (4.2.6)  and  (4.2.7)  can  be  implemented  on  a  computer. 

The  procedure  is  run  twice,  once  with  a  truncation  point  of  Q  =  20  and 
once  with  Q  =  32.  In  each  case,  one  more  term  is  added  so  that  the  mean  of  the 

truncated  sum  is  the  same  as  that  of  the  infinite  sum.  The  results  are  given  in 

Table  7  and  are  compared  to  Anderson  and  Darling’s  (1952)  values.  The  values 
are  compared  in  the  quantile  domain,  not  the  distribution  domain.  They  are 
compared  in  this  domain  since  people  actually  using  the  test  will  want  a  critical 
value  from  the  quantile.  The  large  maximum  percentage  difference  observed 
in  Table  7  occurs  near  the  lower  endpoint  of  the  distribution  where  the  values 
of  the  quantiles  are  near  zero.  Larger  percentage  errors  can  be  forgiven  here. 
The  absolute  error  is  small  throughout  and  the  maximum  percentage  error  for 
«  >  0.25  is  extremely  good. 

The  procedure  has  been  found  to  work  less  well  on  densities  with  singu¬ 
lar  points  or  large  discontinuities.  That  it  should  work  less  well  with  densities 
with  singularities  is  not  surprising.  Most  of  the  simple  numerical  integration 
techniques  will  fail  in  such  cases.  The  problem  with  discontinuities  comes  in 
inverting  the  characteristic  function.  If  f(x)  is  discontinuous  at  x  =  a,  the  in¬ 
version  routine  wants  to  return  a  value  of  [/(a'  ')  +  /(a+)]/2  at  x  -•  a.  If  the 
discontinuity  is  large,  this  tends  to  cause  the  next  integration  routine  to  under¬ 
estimate  F.  If  the  point  of  discontinuity  is  known  and  f(a~)  =  0,  one  can  always 
double 


f 
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Table  7 

FFT  approximation  to  the  quantile  function  of  the  Cramcr-von  Mists 
statistic  under  H0  compared  to  Anderson  and  Darling's  (1952)  values. 
Unless  otherwise  stated  maxima  are  taken  at  a  grid  of  u ’s  between  0.01 
and  0.999. _ 

No.  of  Terms  (@) 


20  32 


max| Qjva/(u)  -  Q(u)| 

0.0029 

0.0019 

maxlQtfA/O*)  -  Q(u)j/Q(u) 

11.7% 

7.6% 

max| -  Q(«)l»  «  >  0.25 

0.0011 

0.00025 

max) Qjvm(«)  ~  Q(u)|/@(u},  u  >  0.25 

0.18% 

0.10% 

M 

400 

400 

N 

2048 

2048 

Q(  0.01) 

0.0248 

0.0248 

@(0.999) 

1.1679 

1.1679 

The  FFT  method  seems  a  very  good  contender  to  existing  techniques  both 
in  terms  of  accuracy  and  speed.  This  method  is  also  applicable  to  a  far  greater 
number  of  cases  than  the  methods  of  Imhof  and  Slepian. 

4.2.4.  Checking  the  Conditions  of  Theorem  4.2.1.  This  subsection  looks 
at  the  details  that  are  involved  in  showing  the  conditions  of  Theorem  4.2.1  are 
met.  These  are  not  shown  for  all  four  distributions  that  are  being  worked  with. 
The  steps  are  very  similar  for  each  and  somewhat  tedious.  Since  the  Cauchy 
distribution  is  widely  used  as  the  exception  to  statistical  rules,  these  conditions 
are  shown  for  Cauchy  location  and  scale  alternatives. 

First  one  needs  to  find  a  sequence  of  constants,  {<Z(jv)}  such  that 

P(“°(N)  ^  . . . ,  Xm,  Yn\, . . . ,  Ynn  <  °(jy)l  1 

asmAn-xx.  This  is  equivalent  to  the  condition  bounding  the  sample  quantile 
function.  Since  A*, . . . ,  Xm  are  iid  F  and  Yn\, ...,  Ynn  are  iid  F(x  —  7/ y/n)  and 
the  two  samples  are  independent,  it  follows  that: 


(4.2.9)  P[-a^j  <  X\, . . . ,  Xm,  Y„  1, . . . ,  Yn\  <  a^J 
=  l Fia(N))  ~  F(~a(N))\m 
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[F(a(^)  -  Tf/\/n)  -  F(-a{JV)  -  Tf/>/n))n 

1  ,  .  im 

IT 


=  [~tan  1  ~  ^  tan  '(-flfjv))] 


•[^tan  l{a(N)--il\/n)-^  tan  l{-a[N)  -  Tf/v^)]” 

=  tan-1(a(Ar)  “  VV")  “  \  tan-1(a(JV)  +  Tf />/»)] 
.[^tan-1a(Ar)]r”, 

since  F(x)  =  1/2  +  (1/tt)  tan-1  x.  Abramowitz  and  Stegun  (1964),  page  81,  give 
the  following  series  representation  for  tan-1: 


oo 


(4.2.10) 

Thus 


-1  JT  1  ^  (“1),  +  1 

tan  1  x  =  - 7—7 r'oTXT  >  1  >  !• 

2  x  4^  (2j  +  l)x2J+1 

j=l 


P["a(A)  ^  •X’l*  •  •  •  >  ^nl,  •  •  • ,  *ni  <  fl(JV)] 


1- 


=  [- 


*<»(*) 

2 


TO, 


(A) 


+  OI 

+ 


~  W*)  *(«(A)  + 

°(a(A))]  , 


since 


+ 


+  °(a(A))- 


a(N)  ~  tA/»  a(A)  +  a(A) 

Using  the  fact  that  x/(x  +  1)  <  ln(l  +  x)  <  1/x  for  x  >  0  one  can  show  that  if 

a(N)  =  N1+e  then 


[1-^+<’(aw)]"-1' 


'(N) 

as  N  — »  oo.  Let  a ^  =  N 2.  Next  it  must  be  shown  that  Q  ff<*>(l-l /N)  <  N2. 
Of  course,  QHW{  1  -  l/N)  <  N 2  if  and  only  if  H{n)(N2)  >  1  -  l/N.  Since 
F(x  —  7/y/n)  <  F(x),  it  is  sufficient  to  check  that  F[N2  —  i/y/n)  >  1  —  l/N. 
It  can  be  shown  by  induction  that  each  partial  sum  in  the  series  in  (4.2.10)  is 
nonnegative  which  means  that 


tan  1  x  >  —  -  x>l, 
”  2  x 


7T  1 
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and  thus 


=  1  - 


ir(N2  -  ~t/y/n) 


£1-iv' 

for  N  sufficiently  large.  Next  the  conditions  on  the  sup  and  inf  of  the  likelihood 
ratio  on  the  range  [ — a(iV)  >  a(JV)l  must  be  checked.  The  likelihood  ratio  is 

/(*)  1+  (x-  i/y/n)2 


f(x  -  i/s/n) 


1  +  x2 


which  has  extrema  at 


x*  =  ^(nf /v/n  ±  \f?jn  + 4). 


It  is  quite  clear  that  /(x*)//(x*  —  'l/y/n)  — »  1  as  m  A  n  — ►  oo.  The  endpoints  of 
the  range  should  be  examined  too,  since  calculus-based  methods  might  miss  the 
endpoints  misbehaving: 

/(JV2)  1  +  (W*  -  7/v^)2 

\+N 4  ’ 

as  m  A  n  — ►  oo.  Therefore,  all  the  conditions  are  met  for  the  Cauchy  location 
alternative. 

Showing  that  the  sample  quantile  is  bounded  for  local  scale  alternatives 
proceeds  in  a  perfectly  analogous  fashion.  Likewise,  showing  that  F(N2/\  1  + 
'l/y/n\)  >  1  —  1/N  is  carried  out  in  the  same  fashion.  This  leaves  checking  the 
sup  and  inf  of  the  likelihood  ratio.  The  likelihood  ratio  for  the  scale  alternative 


is 


/(*) 


_  =  1  ±  (j/(l  +l/\/n\)2 

f(x/{l  +  'r/y/n\)  1+x2 

This  ratio  has  an  extremum  at  x  =  0  and  /( 0)/ /(0)  =  1.  Again,  the  tails  must 
also  be  checked: 

/(JV2)  1  +  N*/{l  +  -r/v/S)2 


WVll  +  7/v^l) 


l  +  N* 


1, 


f 
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as  m  A  n  — +  oo.  Therefore,  all  the  conditions  are  met  for  the  Cauchy  scale 
alternative. 

It  has  been  shown  that  the  conditions  for  the  weak  convergence  of  the  em¬ 
pirical  comparison  distribution  process  are  met  for  Cauchy  location  and  scale 
alternatives.  The  demonstrations  for  other  distributions  follow  in  a  completely 
analogous  manner.  In  the  next  subsection,  power  curves  and  ARE’s  are  found. 

4.2.5.  Power  Curves  and  Asymptotic  Relative  Efficiencies.  Power  curves  for 
local  location  and  scale  alternatives  for  the  normal  and  Cauchy  distributions  are 
given  in  this  subsection.  Power  curves  are  also  calculated  for  two  Fourier  alterna¬ 
tives.  The  subset  chi-square  procedure  is  seen  to  perform  well,  particularly  as  the 
alternative  stresses  higher  components.  An  investigation  is  made  on  the  effect  of 
the  choice  of  bandwidth  on  the  subset  chi-square  test.  A  similar  investigation  is 
conducted  for  <p j* .  The  asymptotic  relative  efficiencies  of  the  first  two  components 
to  standard  rank  statistics  are  found  for  location  and  scale  alternatives  for  four 
underlying  distributions:  normal,  Cauchy,  logistic,  and  Laplace.  The  efficiency 
of  the  components  is  seen  to  vary  with  the  bandwidth.  For  the  distributions 
considered,  a  larger  bandwidth  tends  to  do  better  for  location  alternatives  and 
a  smaller  one  better  for  scale  alternatives. 

For  the  subset  chi-square  test,  a  cutoff  of  0.001  is  used  for  including  com¬ 
ponents  in  the  test.  Thus,  for  h  =  0.5,  0.4,  0.3,  0.2,  and  0.1,  a  truncation  point 
of  M  =  4,  6,  8,  12,  and  23  is  used,  respectively  (cf.  Table  6).  Table  6  would 
say  that  for  h  =  0.5  that  M  =  5  should  be  used  but  it  was  not.  The  eigenvalue 
for  the  fifth  component  in  this  case  is  0.002  so  that  its  exclusion  should  not  be 
significant.  All  power  curves  are  derived  for  Aq  =  0.5  and  0  <  7  <  7. 

Figures  26  and  27  present  power  curves  for  normal  and  Cauchy  location 
alternatives,  respectively.  For  the  normal  case,  the  techniques  arrange  themselves 
as  follows  from  highest  to  lowest  power:  AD,  CVM,  £>0.5 >  subset  chi-square: 
h  =0.5,  0.3,  0.2.  There  is  a  gap  between  the  top  and  bottom  three.  This  is  not 
unexpected  considering  previous  remarks  on  the  behavior  of  these  statistics.  The 
components  are  down-weighted  at  such  a  rate  that  the  first  few  dominate.  The 
normal  location  alternative  affects  mainly  the  first  component. 


Fig.  26.  Power  of  the  subset  chi-square,  <Pq  5,  CVM,  and  AD  tests  against  normal 
location  alternatives.  The  solid  line  is  CVM;  the  +  ’s  are  AD;  the  x ’s  are  <p%  5;  the 
solid  line  with  sparse  blocks  is  the  subset  chi-square,  h  —  0.5;  the  solid  line  with 
dense  blocks  is  the  subset  chi-square,  h  =  0.3;  the  blocks  are  the  subset  chi-square, 
h  =  0.2. 


Fig.  27.  Power  of  the  subset  chi-square,  $>  CVM,  and  AD  tests  against  Cauchy 
location  alternatives.  The  solid  line  is  CVM;  the  +  ’s  are  AD;  the  x’s  are  <Pq  5;  the 
solid  line  with  sparse  blocks  is  the  subset  chi-square,  h  =  0.5;  the  solid  line  with 
dense  blocks  is  the  subset  chi-square,  h  =  0.3;  the  blocks  are  the  subset  chi-square, 
h  =  0.2. 


135 


Table  8  presents  the  efficacies  of  the  components  for  normal  and  Cauchy  loca¬ 
tion  and  scale  alternatives.  One  can  see  that  for  the  normal  location  alternative, 
the  first  component  has  the  largest  efficacy. 

For  the  Cauchy  location  alternative,  the  power  functions  are  much  closer. 
In  fact,  <Pq  5  does  worse  than  the  subset  chi-squares.  Referring  to  Table  8  again, 
one  sees  that  the  Cauchy  location  alternative  places  the  most  weight  on  the 
third  component  and  about  half  as  much  on  the  first.  This  is  unusual  for  a 
location  alternative,  yet  recall  from  Figure  1(c)  that  the  comparison  density  for 
this  case  is  not  monotone.  For  a  bandwidth  of  h  =  0.5,  the  third  component  is 
downweighted  severely  in  ipfi  5  (cf.  Figure  23),  hence  its  poor  performance.  It 
will  also  been  seen  later  in  the  subsection  that  the  components  making  up  CVM 
are  more  efficient  against  this  alternative  than  those  making  up  <pg  -. 

The  situation  changes  even  further  for  scale  alternatives.  Figures  28  and  29 
give  the  power  functions  for  normal  and  Cauchy  scale  alternatives,  respectively. 
Table  8  verifies  that  these  alternatives  principally  affect  the  second  component. 
The  first  component  has  no  influence  at  all  and  so  the  statistics  CVM,  AD, 
and  <p%  drop  off.  In  both  these  cases  the  subset  chi-square  (h  =  0.5)  is  most 
powerful.  The  <pQ  5  statistic  is  next  followed  by  the  two  subset  chi-squares  (h  = 
0.3, 0.1).  The  <£>o.5  statistic  shows  relative  improvement  from  the  Cauchy  location 
alternative  for  two  reasons.  First,  the  second  component  receives  more  weight 
than  the  third  component  which  dominated  the  Cauchy  location  alternative. 
Second,  as  shall  be  seen,  the  second  component  performs  much  better  against 
both  these  alternatives  than  the  first  does  against  Cauchy  location  alternatives. 

Figure  30  presents  the  power  curves  for  what  shall  be  called  Fourier  alter¬ 
native  1.  Referring  to  equation  (4.2.4),  Fourier  alternative  1  is  defined  by  k  =  5 
and  a1  =  2.5  •  (— .4,—  .5,  .6,  l,l)/30.  These  coefficients  are  chosen  so  that  the 
Wilcoxon,  median,  and  Mood  tests  all  have  power  equal  to  their  size.  That  is, 
these  tests  are  no  better  than  one  which  randomly  rejects  H0  100a  percent  of 
the  time.  From  Table  9  it  can  be  seen  that  this  alternative  affects  mainly  the 
third,  fourth,  and  fifth  components.  For  all  but  the  largest  bandwidths,  the 
first  component  is  involved  as  well.  The  significance  of  the  weights  used  by  the 
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Table  8 

Efficacies  of  the  components  of  the  kernel  density  process,  normal  and 


Cauchy  location  and  scale  alternatives,  Ap  =  0.5  and  7  =  1. 


Bandwidth 

Component 

0.1 

0.2 

0.3 

0.4 

0.5 

1 

0.621 

Normal  Location 

0.672  0.685 

0.691 

0.692 

2 

0.000 

0.000 

0.000 

0.000 

0.000 

3 

0.276 

0.180 

0.141 

0.102 

0.058 

4 

0.000 

0.000 

0.000 

0.000 

0.000 

5 

0.138 

0.094 

0.022 

0.053 

0.094 

6 

0.000 

0.000 

0.000 

0.000 

0.000 

7 

0.094 

0.023 

0.050 

0.072 

0.061 

8 

0.000 

0.000 

0.000 

0.001 

0.000 

1 

0.000 

Normal 

0.000 

Scale 

0.000 

0.000 

0.000 

2 

0.967 

0.932 

0.899 

0.871 

0.848 

3 

0.000 

0.000 

0.000 

0.000 

0.000 

4 

0.015 

0.093 

0.143 

0.199 

0.266 

5 

0.000 

0.000 

0.000 

0.000 

0.000 

6 

0.016 

0.063 

0.173 

0.240 

0.337 

7 

0.000 

0.000 

0.000 

0.000 

0.000 

8 

0.018 

0.109 

0.151 

0.198 

0.142 

1 

0.149 

Cauchy  Location 
0.220  0.249 

0.273 

0.300 

2 

0.000 

0.000 

0.000 

0.000 

0.000 

3 

0.475 

0.447 

0.433 

0.418 

0.399 

4 

0.000 

0.000 

0.000 

0.000 

0.000 

5 

0.040 

0.035 

0.007 

0.021 

0.010 

6 

0.000 

0.000 

0.000 

0.000 

0.000 

7 

0.017 

0.003 

0.006 

0.011 

0.002 

8 

0.000 

0.000 

0.000 

0.000 

0.000 

1 

0.000 

Cauchy 

0.000 

Scale 

0.000 

0.000 

0.000 

2 

0.374 

0.426 

0.448 

0.464 

0.480 

3 

0.000 

0.000 

0.000 

0.000 

0.000 

4 

0.302 

0.249 

0.220 

0.185 

0.139 

5 

0.000 

0.000 

0.000 

0.000 

0.000 

6 

0.108 

0.075 

0.017 

0.014 

0.000 

7 

0.000 

0.000 

0.000 

0.000 

0.000 

8 

0.064 

0.013 

0.011 

0.000 

0.002 
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Fig.  28.  Power  of  the  subset  chi-square,  tpfi  5,  CVM,  and  AD  tests  against  normal 
scale  alternatives.  The  solid  line  is  CVM;  the  +  ’s  are  AD;  the  x’s  are  5;  the 
solid  line  with  sparse  blocks  is  the  subset  chi-square,  h  =  0.5;  the  solid  line  with 
dense  blocks  is  the  subset  chi-square,  h  =  0.3;  the  blocks  are  the  subset  chi-square, 
h  =  0.2. 
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Fig.  29.  Power  of  the  subset  chi-square,  5>  CVM,  and  AD  tests  against  Cauchy 
scale  alternatives.  The  solid  line  is  CVM;  the  +  ’s  are  AD;  the  x’s  are  5;  the 
solid  line  with  sparse  blocks  is  the  subset  chi-square,  h  =  0.5;  the  solid  line  with 
dense  blocks  is  the  subset  chi-square,  h  =  0.3;  the  blocks  are  the  subset  chi-square, 
h  =  0.2. 
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Fig.  30.  Power  of  the  subset  chi-square,  < Pq  3,  C  VM,  and  AD  tests  against  Fourier 
alternative  1.  The  solid  line  is  CVM;  the  +  ’s  are  AD;  the  x’s  are  ip g  3;  the  solid 
line  with  sparse  blocks  is  the  subset  chi-square,  h  =  0.5;  the  solid  line  with  dense 
blocks  is  the  subset  chi-square,  h  =  0.3;  the  blocks  are  the  subset  chi-square, 
h  =  0.2. 

Table  9 

Efficacies  of  the  components  of  the  kernel  density  process,  Fourier  al- 


ternatives,  Aq 

=  0.5  and 

7  =  1. 

Bandwidth 

Component 

0.1 

0.2 

0.3 

0.4 

0.5 

Fourier  Alternative  1 

1 

0.272 

0.245 

0.216 

0.138 

0.062 

2 

0.118 

0.090 

0.087 

0.068 

0.016 

3 

0.320 

0.360 

0.395 

0.412 

0.366 

4 

0.467 

0.484 

0.489 

0.510 

0.526 

5 

0.425 

0.441 

0.452 

0.514 

0.564 

6 

0.000 

0.000 

0.000 

0.000 

0.000 

7 

0.000 

0.000 

0.000 

0.000 

0.000 

8 

0.000 

0.000 

0.000 

0.000 

0.000 

Fourier  Alternative  2 

1 

0.047 

0.039 

0.053 

0.099 

0.073 

2 

0.008 

0.019 

0.026 

0.087 

0.063 

3 

0.165 

0.160 

0.180 

0.235 

0.228 

4 

0.317 

0.308 

0.278 

0.343 

0.458 

5 

0.225 

0.263 

0.319 

0.165 

0.004 

6 

0.746 

0.731 

0.737 

0.746 

0.658 

7 

0.681 

0.628 

0.667 

0.682 

0.715 

8 

0.000 

0.000 

0.000 

0.000 

0.000 
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Fig.  31.  Power  of  the  subset  chi-square,  <Pq  3,  CVM,  and  AD  tests  against  Fourier 
alternative  2.  The  solid  line  is  CVM;  the  +  's  are  AD;  the  x ’s  are  <Pq  3;  the  solid 
line  with  sparse  blocks  is  the  subset  ehi-square,  h  =  0.5;  the  solid  line  with  dense 
blocks  is  the  subset  chi-square,  h  =  0.3;  the  blocks  are  the  subset  chi-square , 
h  =  0.2. 

statistics  CVM,  and  AD  is  beginning  to  become  clear.  The  subset  chi-square 
tests  for  each  bandwidth  do  substantially  better  than  the  traditional  statistics 
CVM  and  AD.  The  <Pq  3  statistic  does  improve  on  these  two  considerably,  but 
its  first  component  does  have  a  fair  sized  efficacy  for  this  alternative.  Note  that 
the  power  of  the  subset  chi-square  test  no  longer  decreases  with  the  bandwidth. 
The  ordering  is  h  =0.3,  0.1,  0.5. 

Figure  31  presents  what  shall  be  called  Fourier  alternative  2.  Again,  in 
reference  to  equation  (4.2.4),  this  alternative  is  defined  by  k  =  7  and  a'  — 
(.1978, .3208, -.9395,  —1.308, -.1373, 1, 1)/ 15.  These  coefficients  are  chosen  so 
that  the  Wilcoxon,  median,  normal  scores  (location),  Mood,  and  normal  scores 
(scale)  tests  all  have  power  equal  to  their  size.  From  Table  9,  it  can  be  seen  that 
this  alternative  affects  mainly  the  sixth  and  seventh  components.  This  case  is 
even  more  extreme  than  the  last.  The  subset  chi-square  with  h  =  0.3  and  0.1 
do  very  well.  The  CVM,  AD,  and  <p\  statistics  perform  uniformly  poorly.  The 
subset  chi-square  with  h  =  0.5  is  between  these  two  sets.  The  ordering  of  the 
power  of  subset  chi-square  test  by  bandwidth  is  the  same  as  for  Figure  30. 
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At  this  point  a  word  about  the  effect  of  the  bandwidth  on  the  power  of 
the  subset  chi-square  test  is  in  order.  For  scale  and  location  alternatives,  the 
order  is  according  to  decreasing  bandwidth.  As  the  alternatives  move  to  higher 
components,  this  order  changes.  Including  more  components  that  don’t  have 
much  (any)  efficacy  reduces  the  power  of  the  test.  For  location  and  scale  al¬ 
ternatives,  only  the  first  two  or  three  components  are  important.  Reducing  the 
bandwidth  adds  components  to  the  decision  process  which  carry  little  or  no  sig¬ 
nal  (efficacy).  This  translates  to  a  reduction  in  power.  As  the  alternative  moves 
to  higher  components,  the  larger  bandwidth  excludes  components  that  carry  the 
signal.  That  is,  the  larger  bandwidths  simply  don’t  consider  alternatives  in  these 
directions.  The  larger  bandwidths  then  start  to  be  less  powerful  than  the  smaller 
bandwidths. 

These  observations  strike  at  the  heart  of  the  choice  of  truncation  point.  If 
one  chooses  M  too  large  (h  too  small)  then  power  decreases  because  one  is  adding 
noise  to  the  process.  If  one  chooses  M  too  small  ( h  too  large)  the  test  also  loses 
power  because  components  with  significant  efficacy  are  not  considered.  In  the 
worst  case  the  test  would  be  inconsistent  if  the  alternative  did  not  affect  the  first 
M  components  at  all.  It  is  believed  that  by  choosing  the  bandwidth  carefully 
in  the  initial  stage  these  extremes  can  be  avoided.  This  procedure  is  certainly 
preferable  to  the  alternative  of  using  a  standard  statistic.  In  that  case,  one  is 
assured  of  poor  performance  for  alternatives  stressing  higher  components. 

The  effect  of  the  bandwidth  on  is  less  clear-cut.  Figures  32  and  33 
present  power  curves  for  with  h  =0.5,  0.3,  and  0.1  for  normal  location  and 
scale  alternatives,  respectively.  The  ordering  here  is  more  complex.  For  the 
location  alternative,  the  order  from  most  to  least  powerful  is  h  =0.5,  0.3,  and 
0.1;  for  the  scale  alternative,  it  is  h  =0.3,  0.1,  0.5.  The  location  alternative 
is  easier  to  explain:  h  =  0.5  is  the  most  efficient  first  component  (as  will  be 
seen)  and  it  gives  the  least  weight  to  other  components.  For  scale  alternatives, 
although  h  —  0.1  is  the  most  efficient  bandwidth,  h  =  0.3  is  not  much  worse 
(again,  as  shall  be  seen).  However,  h  =  0.1  gives  much  greater  weight  to  many 
more  components  (recall  Figure  14).  This  added  variability  causes  v^o.l  to  ^ 
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Fig.  32.  Power  of  <p t-test,  and  first  component  ( h  =  0.5)  against  normal 
location  shifts.  The  x ’s  are  <Pq  s;  the  solid  line  with  dense  blocks  is  <Pq  the  +  ’ s 
are  80 ^  ^ ne  15  t-test;  the  blocks  are  the  first  component  ( h  =  0.5). 


Fig.  33.  Power  of  <p2h,  F-test,  and  second  component  ( h  =  0.1)  against  normal 
scale  shifts.  The  x 's  are  <Pq  5;  the  solid  line  with  dense  blocks  is  <£>q  3;  the  +  's  are 
<p q  p-  the  solid  line  is  the  t-test;  the  blocks  are  the  second  component  ( h  =  0.1). 


less  powerful  than  <Pq  3. 

Figures  32  and  33  each  include  two  more  power  curves.  Figure  32  also 
gives  the  power  curve  for  the  f-test  and  the  first  component  ( h  =  0.5).  Figure 
33  includes  the  power  curves  for  the  F-test  and  the  second  component  (h  = 
0.1).  These  curves  illustrate  several  statements  made  in  Section  2.  The  first  is 
that  a  test  against  a  more  specific  alternative  hypothesis  will  tend  to  be  more 
powerful.  The  tests  of  the  first  and  second  components  do  better  than  any  of  the 
portmanteau  tests  for  testing  location  and  scale  shifts,  respectively.  Of  course, 
if  one  used  the  first  component  to  test  a  scale  alternative  one  would  find  it  did 
miserably.  This  fact  is  clearly  demonstrated  in  Table  8:  the  first  component 
has  efficacy  equal  to  0  for  both  the  scale  alternatives.  The  second  point  is  that 
asymptotically,  nonparametric  tests  can  do  just  as  well  as  parametric  tests.  The 
first  and  second  components  are  not  the  optimal  scores  for  shifts  in  the  normal 
distribution  (the  normal  scores  are).  Yet  they  do  very  well,  indeed. 

Tables  10  and  11  give  the  asymptotic  relative  efficiencies  of  the  first  two  com¬ 
ponents  to  standard  rank  statistics.  Table  10  gives  ARE’s  of  the  first  component 
to  the  Wilcoxon,  median,  normal  scores  (location),  and  cosine  tests.  These  are 
all  tests  for  location.  Table  11  gives  the  ARE’s  of  the  second  component  to 
Mood,  normal  scores  (scale),  and  cosine  tests.  These  are  all  tests  for  scale.  The 
component  is  more  efficient  than  the  standard  rank  statistic  if  the  entry  exceeds 

1.  The  score  functions  for  all  the  rank  tests  but  the  cosine  are  given  in  Table 

2.  The  score  functions  for  the  cosine  rank  tests  are  below  their  column  title  in 
Tables  10  and  11.  The  purpose  of  including  the  cosine  rank  statistics  will  become 
clear  shortly. 

From  Table  10  it  appears  that  the  standard  rank  statistics  are  more  efficient 
than  the  components  for  the  Cauchy  and  Laplace  distributions.  For  the  normal 
and  logistic  distributions,  the  best  component  is  on  a  par  with  the  best  standard 
test. 

It  is  apparent  from  Table  10  that  the  bandwidth  does  influence  materially 
the  properties  of  the  components.  Recalling  Figures  15  through  17,  the  score 
functions  for  the  first  component  do  materially  change  with  the  bandwidth.  It  is 
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Table  10 

Asymptotic  relative  efficiencies  of  the  components  to  stan- 
dard  rank  statistics  for  location  alternatives. _ 

Standard  Rank  Statistic 


Bandwidth 

Wilcoxon 

Median 

Normal 

Scores 

Cosine 

COS  ITU 

0.5 

1.016 

Normal 

1.518 

0.970 

1.074 

0.4 

1.011 

1.511 

0.966 

1.069 

0.3 

0.996 

1.489 

0.952 

1.054 

0.2 

0.958 

1.431 

0.915 

1.013 

0.1 

0.817 

1.221 

0.780 

0.864 

0.5 

0.593 

Cauchy 

0.445 

0.827 

0.499 

0.4 

0.491 

0.368 

0.685 

0.413 

0.3 

0.408 

0.306 

0.569 

0.343 

0.2 

0.318 

0.239 

0.444 

0.268 

0.1 

0.146 

0.109 

0.203 

0.123 

0.5 

0.939 

Logistic 

1.250 

0.977 

0.950 

0.4 

0.905 

1.204 

0.942 

0.916 

0.3 

0.857 

1.141 

0.892 

0.867 

0.2 

0.781 

1.039 

0.812 

0.790 

0.1 

0.591 

0.786 

0.615 

0.598 

0.5 

0.722 

Laplace 

0.543 

0.846 

0.668 

0.4 

0.692 

0.521 

0.811 

0.640 

0.3 

0.653 

0.491 

0.765 

0.604 

0.2 

0.574 

0.432 

0.673 

0.531 

0.1 

0.379 

0.285 

0.444 

0.350 
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Table  11 

Asymptotic  relative  efficiencies  of  the  components  to 
standard  rank  statistics  for  scale  alternatives. _ 

Standard  Rank  Statistic 


Bandwidth 

Normal 
Mood  Scores 

Cosine 
cos  2  iru 

0.5 

Normal 

1.009  0.783 

1.345 

0.4 

1.065 

0.826 

1.418 

0.3 

1.134 

0.880 

1.511 

0.2 

1.219 

0.946 

1.625 

0.1 

1.312 

1.019 

1.749 

0.5 

Cauchy 

1.007  1.611 

0.923 

0.4 

0.940 

1.504 

0.862 

0.3 

0.877 

1.402 

0.804 

0.2 

0.794 

1.270 

0.728 

0.1 

0.613 

0.980 

0.562 

0.5 

Logistic 

1.006  0.902 

1.255 

0.4 

1.042 

0.935 

1.300 

0.3 

1.084 

0.973 

1.353 

0.2 

1.125 

1.010 

1.405 

0.1 

1.132 

1.015 

1.413 

0.5 

Laplace 

1.002  0.900 

1.216 

0.4 

1.029 

0.924 

1.249 

0.3 

1.071 

0.961 

1.300 

0.2 

1.115 

1.001 

1.354 

0.1 

1.129 

1.013 

1.370 
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interesting  to  note  that  for  these  four  cases  a  bandwidth  of  0.5  is  more  efficient 
than  0.1.  There  are,  of  course,  cases  where  the  ordering  should  change.  Whatever 
the  character  of  the  underlying  distribution  must  be  for  the  order  to  change,  it 
is  not  embodied  in  the  examples  here.  These  examples,  however,  are  somewhat 
restricted.  Although  they  do  cover  a  range  of  tail  behavior,  they  are  all  unimodal 
densities  which  are  symmetric  about  0. 

The  second  component  is  generally  more  competitive  against  standard  rank 
statistics  than  the  first  as  evidenced  in  Table  11.  There  is  also  considerably 
less  variation  across  bandwidths.  This,  too,  is  not  surprising  if  one  reflects  back 
to  Figures  15  through  17.  The  character  of  the  score  functions  of  the  second 
component  seems  to  change  less  than  that  of  the  first.  A  bandwidth  of  0.1  is 
optimal  in  each  case  except  the  Cauchy  in  which  case  a  bandwidth  of  0.5  is 
preferred. 

The  change  in  the  best  bandwidth  across  location  and  scale  alternatives 
for  the  same  distribution  is  disturbing.  This  means  that  one  cannot  choose 
the  bandwidth  to  best  protect  against  both  location  and  scale  shifts.  One  can, 
however,  select  the  bandwidth  so  that  both  components  do  nearly  as  well  as 
possible  for  both. 

The  cosine  based  rank  statistics  have  been  included  so  that  one  can  com¬ 
pare  the  method  of  Section  3  to  that  of  the  components  of  <p2  advanced  by 
Eubank,  LaRiccia,  and  Rosenstein  (1987).  Their  method  is  based  on  using  a 
complete  orthonormal  basis  as  a  set  of  score  functions;  for  instance,  the  cosine 
basis  {cosjVu}.  They  do  not  develop  an  estimator  of  the  comparison  density, 
nor  do  they  suggest  a  technique  for  testing  the  components.  They  do  discuss 
the  components  as  testing  successively  higher  frequency  departures  of  the  com¬ 
parison  density  from  uniformity.  They  also  point  out  that  the  components  are 
asymptotically  iid  N(0,1). 

Filling  in  the  obvious  details  not  in  their  paper:  if  one  didn’t  have  a  trunca¬ 
tion  point,  A/,  in  mind,  one  could  choose  it  from  the  data.  The  estimate  of  the 
comparison  density  is  an  ordinary  orthogonal  series  estimate  like  those  discussed 
in  Section  2. 
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An  orthogonal  series  was  not  adopted  for  reasons  given  in  Section  2.  One 
would  have  to  be  careful  in  the  choice  of  basis.  Orthogonal  series  methods  can 
impose  constraints  on  the  estimated  density.  The  advantages  of  components 
derived  from  the  eigenfunctions  were  discussed  in  Section  3. 

Admittedly,  the  methodology  based  on  the  boundary  kernel  is  more  complex, 
particularly  computationally,  but  the  burden  brings  with  it  the  convenience  of  use 
and  unification.  It  has  been  said  already  that  the  goal  is  to  make  the  statistician ’s 
job  difficult  so  that  the  researcher’s  job  is  easier. 

If  one  desired  to  compare  the  two  methods  from  a  testing  point  of  view,  it 
would  first  be  necessary  to  specify  the  test  to  be  applied  to  the  orthogonal  series 
components.  Any  method  of  Subsection  3.3  is  applicable.  However,  there  isn’t 
any  need  to  go  to  all  this  trouble.  The  result  one  would  be  certain  to  find  is  that 
which  is  more  powerful  depends  on  the  alternative  being  considered.  Since  both 
tests  are  based  on  the  components,  the  components  with  the  greatest  efficacies 
(and  hence  ARE’s)  should  yield  the  more  powerful  test.  Examining  the  ARE’s  of 
the  cosine  score  functions  in  Tables  10  and  11,  one  sees  that  for  some  alternatives 
such  as  the  Cauchy  location  and  scale,  the  cosine  scores  do  relatively  better.  For 
other  alternatives,  such  as  normal  scale  and  location  ( h  ^  0.1),  and  logistic  and 
Laplace  scale  shifts  the  cosine  basis  does  relatively  worse.  Certainly,  an  overall 
test  would  reflect  these  observations. 

In  summary,  power  and  asymptotic  relative  efficiencies  were  examined  in 
this  subsection.  Power  curves  were  found  for  location  and  scale  alternatives  for 
the  normal  and  Cauchy  distributions.  Power  curves  were  also  found  for  two 
Fourier  alternatives.  These  curves  were  derived  for  the  subset  chi-square  test 
applied  to  the  components,  <p\,  CVM,  and  AD.  It  was  observed  that  as  the 
alternative  affected  higher  components,  these  last  three  statistics  performed  ever 
more  poorly.  It  was  also  observed  that  the  most  powerful  bandwidth  is  the  largest 
bandwidth  which  still  has  a  truncation  point  which  picks  up  the  components  most 
important  to  the  alternative  under  consideration. 


The  ARE’s  of  the  first  two  components  to  standard  rank  tests  were  found 
for  location  and  scale  alternatives  of  four  underlying  distributions:  the  normal, 
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Cauchy,  logistic  and  Laplace.  The  components  generally  performed  better  for 
scale  alternatives  than  location  alternatives.  The  bandwidth  was  seen  to  affect 
the  performance  of  the  components.  It  was  also  seen  one  cannot  always  choose 
the  bandwidth  to  protect  best  against  both  location  and  scale  alternatives  for  a 
given  underlying  distribution. 

4.3.  Size  Studies 

4.3.1.  Introduction.  The  finite-sample  size  of  the  subset  chi-square  test  is 
investigated  in  this  subsection.  Simulations  are  run  to  determine  the  size  of  the 
test  using  both  the  small  sample  mean  and  the  asymptotic  mean  when  centering 
the  components.  The  sizes  using  the  small  sample  mean  are  seen  to  be  much 
better  than  those  found  using  the  asymptotic  mean.  The  sizes  for  the  small 
sample  mean  also  tend  to  be  below  their  nominal  value.  This  means  that  the 
test  is  conservative,  which  is  better  than  being  liberal.  The  estimated  sizes  using 
the  asymptotic  mean  tend  to  be  greater  than  the  stated  size. 

4.3.2.  Size  Study.  Simulations  are  used  to  estimate  the  small  sample  size 
of  the  subset  chi-square  test  applied  to  the  components.  They  are  conducted 
as  follows.  For  each  set  of  sample  sizes,  m  and  n,  and  for  each  bandwidth  and 
truncation  point,  R  iterations  are  made.  Within  each  iteration,  two  independent 
random  samples  of  sizes  m  and  n  are  drawn  from  the  U(0,J)  distribution.  The 
boundary  kernel  estimate  of  the  comparison  density  is  found.  From  it  the  compo¬ 
nents  are  calculated.  Either  the  small  sample  or  asymptotic  mean  is  subtracted, 
as  appropriate.  The  subset  chi-square  test  is  then  applied  to  the  components. 
For  the  jth  iteration,  Bj  is  set  to  1  if  the  test  rejects  and  0  if  not.  The  size  of 
the  test  is  estimated  as 
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There  is  no  reuse  of  the  data  here  as  in  the  simulations  to  find  power  curves. 
These  simulations  are  to  find  point  estimates,  not  estimates  of  fun  ions.  Reuse  of 
the  data  serves  no  purpose  in  this  context.  For  simulations  using  the  asymptotic 


148 


mean,  R  =  1000;  for  simulations  using  the  small  sample  mean,  R  =  5000.  A 
greater  number  of  simulations  are  run  using  the  small  sample  mean  because  this 
is  the  case  of  greater  interest.  The  simulations  using  the  asymptotic  mean  are 
run  to  illustrate  the  improvements  that  result  from  using  the  small  sample  mean. 
All  subset  chi-square  tests  are  conducted  at  a  stated  size  of  0.05.  The  simulations 
are  run  for  h  =  0.5  (M  =  4),  h  =  0.3  (M  =  8),  and  h  =  0.2  (M  =  12).  They  are 
run  for  various  combinations  of  m  and  n  ranging  from  5  to  100. 

Table  12  gives  the  results  using  the  small  sample  means.  The  sizes  look 
remarkably  good.  The  half  width  of  a  joint  confidence  interval  to  test  H0.'a,  = 
0.05,  for  t  =  1, . . . ,  30  is 


*0.975  ^ 


.05  •  0.95 
5000 


0.010. 


The  joint  test  is  rejected.  Ten  of  the  thirty  estimated  sizes  fall  outside  the 
confidence  limits.  These  are  n  =  m  =  5  for  h  =  0.3, 0.2;  n  =  10,  m  =  5  for 
h  —  0.3, 0.2;  n  =  100,  m  =  5  for  h  —  0.5, 0.3, 0.2;  n  =  m  =  10  for  h  —  0.3, 0.2; 
and  n  =  20,  m  =  10  for  h  =  0.2.  These  are  the  smallest  sample  sizes  and  the 
smallest  bandwidths.  Only  one  size  corresponding  to  a  bandwidth  of  0.5  falls 
outside  the  confidence  limits. 

The  sizes  which  are  significantly  different  from  0.05  fall  into  one  of  two 
categories.  The  first  category  is  small  but  nearly  equal  sample  sizes  and  very 
small  bandwidth.  With  5  or  10  observations,  nobody  would  use  a  bandwidth  as 
small  as  0.3  even.  In  this  category  are  n  =  m  =  5;  n  =  10,  m  =  5;  n  —  m  =  10 
and  n  =  20,  m  =  10.  Hence,  the  fact  that  these  sizes  are  significantly  different 
from  the  stated  size  is  not  that  much  of  an  issue.  The  second  category  is  very 
unequal  sample  sizes.  This  is  the  case  of  n  =  100  and  m  =  5.  If  this  sort  of 
case  were  to  arise,  one  would  need  to  be  aware  that  the  size  of  the  test  may 
be  smaller  than  stated.  However,  such  extreme  differences  in  sample  size  don’t 
usually  arise. 

Of  the  sizes  found  to  be  different  from  0.05,  8  are  below  0.05  and  only  2 
above.  Of  all  the  estimated  sizes,  6  are  above  0.05  and  24  below.  The  test  seems 
to  err  on  the  side  of  falling  under  0.05.  This  is  good  as  it  means  the  test  is 
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Table  12 

Estimated  small  sample  sizes  of  the  subset  chi-square 
test  applied  to  the  components  centered  by  their  small 


sample 

mean. 

The  stated  size 

of  the  test  is  0.05. 

m 

n 

Number  of  Components 

4 

8 

12 

5 

5 

0.060 

0.031 

0.117 

5 

10 

0.046 

0.035 

0.023 

5 

100 

0.036 

0.035 

0.034 

10 

10 

0.049 

0.039 

0.024 

10 

20 

0.047 

0.042 

0.038 

10 

100 

0.042 

0.046 

0.040 

20 

20 

0.057 

0.043 

0.040 

20 

100 

0.052 

0.049 

0.050 

50 

50 

0.048 

0.045 

0.049 

50 

100 

0.049 

0.052 

0.047 

bandwidth 

0.5 

0.3 

0.2 

conservative.  When  a  researcher  runs  a  test  at  size  0.05,  he  has  decided  that  he 
will  accept  false  rejections  5%  of  the  time.  It  is  better  that  when  the  test  rejects, 
the  probability  of  a  false  rejection  is  less  than  this  figure  than  above.  Taken  as 
a  whole,  Table  12  presents  very  encouraging  results. 

Table  13  presents  the  estimated  sizes  of  the  subset  chi-square  test  applied  to 
the  components  which  are  centered  by  their  asymptotic  mean.  The  situation  here 
is  much  less  satisfactory  than  that  above.  The  half  width  of  a  joint  confidence 
interval  for  testing  H0: cq  =  0.05  for  t  =  1, ...  ,33  is 

ZOAIS1!33 

Again,  the  global  test  is  rejected.  Seven  of  the  sizes  fall  outside  their  confidence 
interval.  These  are:  m  =  n  =  10  for  h  =  0.3, 0.2;  m  =  20,  n  =  10  for  h  = 
0.5, 0.3, 0.2;  and  m  =  100,  n  =  10  for  h  =  0.3, 0.2.  Table  13  has  fewer  significantly 
different  sizes  than  Table  12  for  two  reasons.  The  first  is  that  Table  13  has  no 
sample  sizes  below  10.  Seven  of  the  ten  significant  sizes  in  Table  12  have  a  sample 
size  of  5.  The  second  is  that  Table  13  is  constructed  with  only  1000  replications 
and  so  the  estimates  are  less  accurate. 


/ 


0.05  •  0.95 
1000 


«  0.022. 
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Table  13 


Estimated  small  sample  sizes  of  the  subset  chi-square 
test  applied  to  the  components  centered  by  their  asymp- 
totic  mean.  The  stated  size  of  the  test  is  0.05. _ 


a 

Number  of  Components 

n 

4 

8 

12 

10 

10 

0.063 

0.076 

0.081 

10 

20 

0.051 

0.057 

0.048 

10 

100 

0.046 

0.043 

0.039 

20 

10 

0.073 

0.084 

0.073 

20 

20 

0.059 

0.058 

0.052 

30 

30 

0.059 

0.050 

0.057 

50 

50 

0.048 

0.051 

0.052 

50 

100 

0.052 

0.059 

0.053 

100 

10 

0.071 

0.091 

0.085 

100 

50 

0.046 

0.052 

0.048 

100 

100 

0.049 

0.053 

0.055 

bandwidth 

0.5 

0.3 

0.2 

There  are  two  disturbing  features  about  Table  13.  First,  it  is  quite  clear 
that  the  test  using  the  asymptotic  mean  is  not  invariant.  The  discussion  in 
Subsection  3.3.3  predicted  that  for  m  n,  one  would  reject  too  often.  This 
is  precisely  the  case  observed  here.  To  reach  a  different  conclusion  based  on 
which  sample  is  termed  the  first  is  extremely  undesirable.  The  second  disturbing 
feature  is  related  to  the  first.  All  7  of  the  significant  sizes  are  above  0.05;  taking 
the  table  as  a  whole,  24  of  the  33  (73%)  are  above  0.05.  If  the  procedure  must 
err,  it  is  preferable  for  the  size  to  be  below  what  is  stated,  not  above. 

There  are  two  conclusions  to  be  drawn  from  this  subsection.  The  first  is  that 
the  subset  chi-square  using  the  components  centered  by  their  small  sample  means 
performs  very  well  in  terms  of  keeping  its  stated  size  even  for  small  samples. 
This  is  true  as  long  as  a  reasonable  bandwidth  is  used  and  the  samples  sizes  are 
not  very  dissimilar.  When  it  does  err,  it  tends  to  err  on  the  side  of  having  a 
smaller  than  stated  size.  This  is  also  good.  The  second  is  that  subtracting  the 
small  sample  mean  instead  of  the  asymptotic  mean  is  a  good  idea.  The  test  is 
invariant  and  the  estimated  sizes  are  much  closer  to  the  stated  value. 
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In  summary  of  the  section,  power  and  size  tests  have  been  run.  The  subset 
chi-square  test  applied  to  the  components  of  the  kernel  density  process  was  seen 
to  be  a  credible  procedure.  It  is  very  competitive  with  the  Cramer-von  Mises 
and  Anderson-Darling  statistics.  These  latter  two  statistics  were  seen  to  have 
very  low  power  against  alternatives  stressing  higher  order  components.  This 
deterioration  starts  with  the  second  component.  By  the  time  the  fourth  and 
higher  components  make  up  the  alternative,  these  statistics  do  very  poorly.  The 
subset  chi-square  test  did  not  exhibit  this  trait.  Instead,  the  choice  of  truncation 
point  plays  a  crucial  role  in  determining  its  power.  If  the  truncation  point  is 
chosen  too  large,  then  the  test  loses  power  because  the  signal  is  lost  in  the  noise. 
If  the  truncation  point  is  chosen  too  small,  then  the  test  loses  power  because  the 
signal  is  excluded  from  the  test.  A  careful  choice  of  bandwidth  should  reduce 
the  likelihood  of  experiencing  these  extremes. 
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5.  EXAMPLES  AND  APPLICATIONS 


5.1.  Introduction 

The  two  sample  procedure  derived  in  Section  3  is  applied  to  two  data  sets 
in  this  section.  The  first  data  set  is  the  observed  weekly  rate  of  return  to  two 
savings  and  loan  institutions  over  a  103  week  period  while  the  second  data  set 
is  simulated.  The  two  samples  are  each  normally  distributed  but  with  differ¬ 
ent  means  and  variances.  This  is  an  example  of  the  well-known  Behrens-Fisher 
problem  [see  Kendall  and  Stuart  (1979),  pages  152  fF.J.  The  first  data  set  is  an 
example  of  a  case  where  none  of  the  standard  statistics  identify  a  difference  in 
the  populations.  The  second  is  an  example  in  which  all  the  standard  statistics 
identify  a  difference  in  the  populations.  In  the  first  instance,  it  will  be  seen  how 
the  subset  chi-square  test  applied  to  the  components  can  find  differences  where 
the  others  fail.  In  the  second  instance,  it  will  be  seen  how  the  new  methodology 
can  clarify  a  distinction  also  found  to  exist  by  other  methods. 

5.2.  The  Savings  and  Loan  Data 

The  savings  and  loan  data  consist  of  the  weekly  rate  of  return  for  two  New 
York  Stock  Exchange  listed  savings  and  loans  over  a  103  week  period.  The 
observation  period  is  July  3,  1981  to  June  30,  1983.  The  first  sample  consists 
of  returns  for  H.F.  Ahmanson  and  Company;  the  second  consists  of  returns  for 
Financial  Corporation  of  Santa  Barbara.  The  data  are  drawn  from  Standard  and 
Poor’s  Stock  Price  Data.  The  return  for  week  t  is  defined  as 

where  P(t)  is  the  price  in  week  t.  Dividends  are  added  to  the  price  in  the  week 
they  are  paid.  This  definition  is  often  used  in  the  finance  literature;  see,  for 
example,  Fama  (1976)  pages  12-20. 

The  question  to  be  investigated  is  whether  the  returns  are  distributed  the 
same  for  the  two  institutions.  One  might  suppose  that  the  distribution  of  returns 


153 


differs  by  region,  solvency,  or  in  reaction  to  some  outside  event.  Often  one  is 
interested  in  whether  returns  differ  for  the  same  company  or  industry  in  two 
different  time  periods.  These  sorts  of  questions  are  not  infrequently  asked  in 
finance.  See,  for  example,  Dann  and  James  (1982)  or  Brown  and  Warner  (1980). 

The  returns  for  H.F.  Ahmanson  and  Company  are  given  in  Table  14.  The 
returns  for  Financial  Corporation  of  Santa  Barbara  are  given  in  Table  15.  The 
data  sets  are  pictured  in  Figures  34  and  35.  Graphed  as  time  series,  they  have 
much  the  same  appearance.  These  data  are  assumed  to  be  realizations  of  inde¬ 
pendent  sets  of  iid  random  variables.  It  is  important  to  check  the  validity  of  these 
assumptions.  Since  these  data  sets  are  observed  economic  time  series,  it  would  be 
reasonable  to  expect  them  to  be  autocorrelated  although  this  effect  may  well  be 
reduced  by  taking  the  differences  of  the  logarithms  of  the  data.  Since  each  sav¬ 
ings  and  loan  is  subject  to  similar  national  economic  and  regulatory  conditions, 
it  is  also  reasonable  to  expect  the  two  series  not  to  be  independent. 

Timeslab  [Newton  (1988)j  is  used  to  determine  the  correlation  structure  of 
each  series.  In  each  case,  the  CAT  [Parzen  (1977)]  criterion  chose  an  autoregres¬ 
sive  order  of  zero.  While  not  a  guarantee  of  independence,  the  lack  of  correlation 
is  very  good  news.  The  sample  cross-correlation  coefficient  between  the  two  series 
is  calculated  as  r  =  0.429.  This  value  is  significant  at  the  1%  level. 

A  positive  correlation  between  the  two  series  would  likely  reduce  the  chance 
of  rejecting  H0  if  it  were  true  since  the  two  series  would  appear  more  similar.  The 
extreme  case  would  be  if  the  correlation  were  1.  Breaking  ties  by  midranking, 
the  ranks  would  be 


Ri  = 


2*  ~(l/2) 
N 


which  are  extremely  uniform.  Lower  levels  of  correlation  should  still  lead  to  more 
uniform  ranks  than  would  otherwise  be  observed.  Hence,  tests  based  on  ranks 
are  expected  to  be  conservative  in  this  case. 

This  analysis  is  borne  out  by  a  simulation  study.  Two  samples  of  size  100  are 
drawn  from  the  standard  normal  distribution.  Three  levels  of  cross-correlation 
are  used:  0.447,  0.707,  and  0.894.  As  the  savings  and  loan  data  appears  nearly 
normally  distributed,  this  choice  of  distribution  and  sample  sizes  should  yield 
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Table  14 

Weekly  returns  for  H.F.  Ahmanson  and  Company  from  July  3,  1981  to  June  30, 


1983.  Table  values  are  multiplied  by  100. 


-1.400 

0.000 

-6.000 

1.500 

-0.800 

-0.800 

3.000 

-0.700 

-0.800 

7.300 

-2.900 

-6.700 

-0.800 

0.800 

0.000 

-4.000 

-5.000 

0.800 

5.700 

3.900 

-1.500 

0.000 

-7.200 

-1.700 

1.700 

0.800 

-4.200 

-14.800 

-6.200 

6.200 

-8.300 

-11.500 

-2.500 

16.100 

-11.200 

1.700 

-1.700 

0.000 

2.400 

0.000 

2.300 

0.000 

-5.800 

7.000 

0.000 

-10.700 

-6.500 

-8.300 

2.900 

-4.300 

1.500 

4.300 

-2.800 

1.400 

5.500 

-2.700 

2.700 

3.900 

32.500 

-3.800 

9.200 

-2.700 

-0.900 

3.600 

7.600 

11.500 

1.400 

9.500 

3.200 

26.900 

2.800 

8.100 

2.600 

0.000 

0.000 

0.400 

0.400 

-7.800 

-3.700 

16.300 

-26.700 

9.400 

-1.900 

6.500 

-1.800 

16.800 

15.100 

-11.300 

1.500 

-6.500 

0.800 

8.200 

7.900 

2.000 

1.000 

0.300 

1.900 

-4.500 

-10.000 

1.800 

-1.100 

-2.200 

-7.300 

Table  15 

Weekly  returns  for  Financial  Corporation  of  Santa  Barbara  from  July  3,  1981  to 
June  30,  1983.  Table  values  are  multiplied  by  100. _ 

6.372 


2.740 

-11.955 

-5.219 

-1.802 

2.817 

0.000 

4.652 

-4.652 

0.000 

-2.985 

1.770 

-15.763 

-2.062 

1.361 

-4.027 


-2.740 

0.000 

-5.506 

-7.551 

-8.701 

0.000 

12.783 

9.097 

3.774 

5.884 

-3.572 

23.767 

2.062 

7.796 

-4.196 


5.407 

6.156 

-5.827 

-4.001 

-3.077 

-8.338 

7.686 

0.000 

31.508 

20.585 

-9.531 

-10.110 

15.123 

-11.935 

-4.380 


-6.156 

-4.082 

2.020 

-6.454 

19.671 

-3.774 

4.256 

5.264 

6.744 

-4.082 

0.000 

-5.407 

0.000 

2.941 


-2.500 

-6.560 

9.909 

-2.020 

-10.536 

0.000 

-16.705 

8.004 

-10.821 

-2.198 

6.063 

2.105 

3.637 

5.481 

4.256 


-6.538 

-5.219 

5.506 

-6.318 

-3.774 

-3.637 

-4.652 

3.774 

-8.961 

2.198 

-1.980 

0.000 

17.934 

6.454 


-4.139 

5.219 

0.000 

-27.329 

-3.922 

-25.131 

4.652 

-3.774 

6.063 

19.671 

-4.082 

2.062 

8.577 

-5.129 
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Fig.  34.  Observed  weekly  returns  for  H.F.  Ahmanson  and  Company  from  July  3, 
1981  to  June  30,  1983.  This  is  the  first  sample  in  the  analysis. 


Fig.  35.  Observed  weekly  returns  for  Financial  Corporation  of  Santa  Barbara 
from  July  3,  1981  to  June  30,  1983.  This  is  the  second  sample  in  the  analysis. 
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Table  16 

Estimated  small  sample  sizes  of  the  sub¬ 
set  chi-square  test  applied  to  compo¬ 
nents  which  are  derived  from  correlated 
samples.  The  stated  size  of  the  test  is 


0.05. 


Number  of  Components 

Correlation 

6 

9 

0.447 

0.031 

0.034 

0.707 

0.020 

0.020 

0.894 

0.005 

0.009 

Bandwidth 

0.3 

0.2 

insight  into  the  behavior  of  the  subset  chi-square  tests  in  this  situation.  The 
components  are  found  and  the  subset  chi-square  test  for  the  truncation  points 
M  =  6,  9  corresponding  to  bandwidths  of  h  —  0.3,  0.2  is  applied  to  them.  The 
nominal  size  of  the  test  is  5%.  The  indicator  Bj  is  set  to  1  if  the  test  rejects  and 
0  if  not.  These  indicators  are  then  averaged  over  1000  replications  to  estimate 
the  size  of  the  test.  Table  16  gives  the  results.  In  each  case,  the  estimated  size  is 
less  than  5%.  The  higher  the  correlation,  the  lower  is  the  estimated  size.  Since 
the  test  is  conservative  in  this  case  and  the  reduction  in  its  size  for  the  level  of 
correlation  observed  in  the  data  is  not  extreme,  the  analysis  will  continue. 

Table  17  presents  the  summary  statistics  for  the  two  samples  and  the  pooled 
sample.  It  is  quite  difficult  to  distinguish  between  the  two  based  on  these.  Figures 
36  through  38  present  the  identification  quantile  plots  for  the  first,  second,  and 
pooled  samples,  respectively.  These  graphs  are  constructed  following  Parzen 
(1979)  and  described  briefly  here.  A  smoothed  version  of  the  sample  quantile 
function  for  tne  first  sample  is  given  by  linearly  interpolating  the  points 

Q(u)  =  for  u  =  - — j  =  l,...,m, 

where  X^ j  is  the  jtfl  order  statistic.  The  identification  quantile  function,  QI(u), 
is  defined  as 


QI(tt)  =  ($(«)  -  MQ)/DQ, 
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Table  17 


Sample  statistics  for  the  savings  and  loan  data. 


Statistic 

First 

Sample 

Second 

Sample 

Pooled 

Sample 

Median 

0.0019 

-0.0115 

-0.0004 

Twice  Interquartile  Range 

0.1167 

0.1971 

0.1612 

Maximum 

0.3254 

0.3151 

0.3254 

Minimum 

-0.2667 

-0.2733 

-0.2733 

Mean 

0.0057 

0.0000 

0.0029 

Variance 

0.0060 

0.0079 

0.0069 

Standard  Deviation 

0.0773 

0.0891 

0.0832 

Fig.  36.  The  identification  quantile  function  of  weekly  returns  for  H.F.  Ahmanson 
and  Company. 


Fig.  37.  The  identification  quantile  function  of  weekly  returns  for  Financial  Cor¬ 
poration  of  Santa  Barbara. 
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Fig.  38.  The  identification  quantile  function  of  a  pooling  of  weekly  returns  for 
H.F.  Ahmanson  and  Company  and  Financial  Corporation  of  Santa  Barbara. 


where  MQ  is  the  sample  median  and  DQ  =  2[Q(.75)  —  Q(.25)]  is  the  sample  value 
of  twice  the  interquartile  range.  The  diagonal  reference  line  in  these  figures  is  the 
identification  quantile  function  of  the  uniform  distribution.  Normally  distributed 
data  will  enter  and  exit  near  the  corners  of  the  box  and  have  am  inflection  point 
at  u  =  0.5.  By  subtracting  MQ  and  dividing  by  DQ,  plots  of  the  identification 
quantile  function  attempt  to  identify  classes  of  distributions  apart  from  location 
and  scale. 

Examining  these  figures,  the  data  appear  to  have  slightly  longer  than  normal 
tails  since  they  exit  the  boxes  short  of  the  corners.  The  three  graphs  appear  quite 
similar.  The  identification  quantile  function  of  the  first  sample  seems  to  follow 
the  uniform  reference  line  for  a  greater  distance  than  the  second  sample.  The 
fact  that  the  identification  quantile  function  of  the  pooled  sample  falls  below  the 
uniform  reference  line  on  about  the  range  u  =  0.25  to  u  =  0.5  where  the  others 
don’t  may  indicate  differences.  However,  there  are  difficulties  in  comparing  plots 
such  as  these.  It  is  hard  to  tell  if  the  differences  are  really  there  or  are  due  to 
random  variation. 

Figure  39  presents  an  overlay  of  the  identification  quantile  functions  for  each 
sample.  Here  the  differences  for  the  left  tail  (u  <  0.5)  are  brought  into  sharper 
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Fig.  39.  An  overlay  of  the  identification  quantile  functions  of  weekly  returns  for 
H.F.  Ahmanson  and  Company  { solid  with  blocks)  and  Financial  Corporation  of 
Santa  Barbara  ( solid  line). 

focus.  Still,  the  question  remains  whether  this  difference  is  really  there  or  is  due 
to  random  variation.  However,  the  purpose  of  these  plots  is  to  draw  attention  to 
such  possibilities. 

Figure  40  presents  another  type  of  plot  often  used  to  compare  two  popula¬ 
tions.  This  is  a  QQ  plot.  Each  box  in  the  figure  represents  a  pair  Fj*))- 

The  graph  should  be  approximately  linear  with  slope  near  1  if  the  two  popula¬ 
tions  are  the  same.  The  function  appears  quite  linear  with  slope  approximately 
equal  to  1  except  in  the  tail.  Again,  the  question  is  how  far  away  the  pictured 
function  must  be  from  the  ideal  for  the  two  populations  to  be  declared  different. 
Some  work  has  been  done  in  this  area;  see,  for  example,  Aly  and  Bleuer  (1986). 
Their  work  will  not  be  pursued  here. 

Figure  41  gives  the  sample  comparison  distribution  function,  Djq(u),  for  the 
data  and  a  diagonal  line  for  reference.  Immediately  apparent  is  the  jump  at 
u  =  0.5.  Although  less  apparent  from  the  other  graphs,  it  can  be  seen  in  Tables 
14  and  15  that  the  data  has  repeats  at  a  value  of  0.  There  are  some  weeks  that 
the  stock  price  doesn’t  change.  Repeat  values  violate  the  assumption  that  the 
distribution  functions  F  and  G  are  continuous.  The  analysis  seems  quite  robust 


mson  and  Company 
\ra  (vertical  axis). 
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to  this  departure.  The  complete  analysis  has  been  repeated  two  different  ways 
without  a  major  change  in  results.  The  first  was  to  add  normal  white  noise  with 
a  very  small  variance  to  each  series.  This  randomly  breaks  the  ties.  The  results 
did  not  change.  The  second  way  was  to  conduct  the  analysis  conditioned  on  the 
return  being  non-zero.  Again,  the  analysis  did  not  materially  change.  In  the 
analysis  conducted  in  this  subsection,  ties  are  resolved  by  midranking. 

Returning  to  Figure  41,  one  sees  some  departure  from  uniformity  but  not 
enough  for  a  clear  rejection  of  H0.  More  informative  is  Figure  42,  which  presents 
y^N\[N)/(l  —  \[N}{Dtf(u)  -  uj.  Under  H0,  this  function  converges  weakly  to 
a  Brownian  bridge  process.  The  maximum  absolute  deviation  of  the  function 
pictured  in  Figure  42  is  not  quite  large  enough  for  a  Kolmogorov-Smirnov  test 
to  reject  H0-  However,  from  the  overall  appearance  of  c  graph,  one  might 
question  whether  there  are  enough  zero  crossings  for  this  to  be  a  sample  path 
of  a  Brownian  bridge  process.  The  sample  path  seems  somewhat  deterministic. 
It  is  below  0  for  u  <  0.5  and  mostly  above  0  for  u  >  0.5.  Table  18  presents 
the  observed  and  critical  values  of  the  Cramdr-von  Mises,  Anderson-Darling  and 
Kolmogorov-Smirnov  statistics.  None  rejects  H0  at  the  5%  level. 

Figure  43  presents  diagnostics  for  the  choice  of  bandwidth.  The  function 
pictured  is  given  by  linearly  interpolating  the  points 

(5.2.1)  ./  \f>MN)  -  -1-1,  for  =  1 . m, 

V1_a(A)l  m  +  lJ 

where  Dh( w)  =  Jq  d^(u)du.  If  Dh(w)  were  the  true  comparison  distribution 

function,  the  graphs  should  appear  as  a  Brownian  bridge  process.  Recall  from 

Subsection  3.3.7  that  the  goal  is  to  undersmooth  the  data  slightly.  Undersmooth- 
* 

ing  d/»(u/)  causes  the  deviations  from  0  of  the  process  defined  in  (5.2.1)  to  be  too 
small.  Referring  to  Figure  43,  a  bandwidth  of  0.1  [Figure  43(a)]  is  clearly  un¬ 
dersmoothing.  Figure  43(b)  ( h  =  0.2)  is  also  undersmoothed,  but  not  as  much: 
its  deviations  from  0  are  larger.  Figure  43(c)  seems  about  the  right  amount  of 
smoothing  and  Figure  43(d)  appears  to  oversmooth.  A  bandwidth  of  h  =  0.2  is 
chosen. 

Figure  44  gives  the  corresponding  estimates  of  the  comparison  density  func- 
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Fig.  42.  The  aamplt  null  empirical  comparison  distribution  process  for  the  savings 
and  loan  return  data.  The  function  pictured  is  y/N[Dff(u)  —  u]. 


Table  18 


Two  sample  portmanteau  statistics  for  the  savings  and  loan  data. 


Statistic 

Observed 

Value 

5%  Critical 
Value 

1%  Critical 
Value 

Cramer-von  Mises 

0.288 

0.460 

0.740 

Anderson-Darling 

1.420 

2.490 

3.850 

Kolmogorov-Smirnov 

1.184 

1.360 

1.640 

*>0.2 

9.304 

8.931 

12.400 

(a) 

.1  72  .3  A 

.3  ,t  .7  J  .7  1 

(e) 

\N  I\I 

'V 

L.  i  L  .J, 

1  .1  .2  .3  .< 

\j  Vv 

Fig.  43.  Sample  paths  of  \/N[bh(u)  -  u]  for  the  savings  and  loan  return  data. 
Figure  (a)  pictures  the  process  for  h  =  0.1;  Figure  (6)  for  h  =  0.2;  Figure  (c)  for 
h  =  0.3;  and  Figure  (d)  for  h  =  0.4. 
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Fig.  45.  The  criterion  function,  C(k),  for  the  saxnngs  and  loan  data.  Also  pictured 
are  the  criterion  values  for  the  next  best  subset. 


Table  19 


The  first  nine  squared  components  of  the 
kernel  density  process  ( h  =  0.2)  for  the 
savings  and  loan  data. _ 


Component 

Number 

Squared 

Value 

1 

0.322 

2 

2.538 

3 

0.923 

4 

8.269 

5 

0.719 

6 

3.198 

7 

0.016 

8 

3.669 

9 

1.344 

I 
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Fig.  46.  The  boundary  kernel  estimate  ( h  =  0.2)  and  the  orthogonal  series  es¬ 
timate  ( components  4,  8,  6,  and  2)  of  the  comparison  density  function  for  the 
savings  and  loan  return  data.  The  blocks  are  the  boundary  kernel  estimate  and 
the  solid  line  is  the  orthogonal  series  estimate. 


The  two  have  very  much  the  same  character.  To  interpret  Figure  46  one  starts 
by  observing  that  fQH{u )  >  gQH(u)  for  u  values  between  about  u  =  0.3  and 
u  =  0.7.  The  sample  pooled  quantile,  Q$(u),  appears  largely  symmetric  about  0 
(recall  Figure  38).  This  implies  that  f  >  g  for  values  closer  tc  0  and  that  f  <  g 
in  the  tails  away  from  0.  This  is  not  merely  an  artifact  of  the  repeated  values  at 
0.  The  first  sample  has  9  repeats  at  0;  the  second  sample  10. 

From  Figure  45  there  appear  to  be  two  other  models  which  should  be  exam¬ 
ined.  One  has  three  components:  4,  8,  and  6.  The  other  has  five  components: 
4,  8,  6,  2,  and  9.  Figure  47  presents  the  other  two  orthogonal  series  estimates. 
Figure  47(a)  pictures  the  model  with  three  components  and  Figure  47(b)  pictures 
the  model  with  five  components.  The  former  pictures  the  comparison  density  as 
rising  back  above  1  near  both  endpoints.  The  latter  agrees  substantially  with 
the  estimates  pictured  in  Figure  46. 

The  Cram4r-von  Mises  and  Anderson-Darling  statistics  failed  to  reject  H0 
because  the  two  samples  have  about  the  same  location  and  scale.  The  <Pq  2 
statistic  detected  a  difference  because  it  gives  much  greater  weight  to  the  fourth 
component  than  the  other  two  statistics.  The  ratio  of  the  fourth  eigenvalue  to 


Fig.  47.  Alternate  orthogonal  series  estimates  of  the  comparison  density  function 
for  the  savings  and  loan  return  data.  Figure  (a)  is  based  on  a  subset  of  size  3 
( components  4,  8,  and  6)  and  Figure  (6)  is  based  on  a  subset  of  size  5  ( components 
4,  8,  6,  2  and  9). 

the  first  for  <p%  2  *s  0 .49;  for  the  Cramer-von  Mises  it  is  0.0625;  and  for  the 
Anderson-Darling  it  is  0.10. 

The  new  methods  suggested  by  this  work  were  able  to  detect  differences 
in  the  data  that  the  standard  portmanteau  statistics  (Kolmogorov-Smirnov, 
Cramer-von  Mises,  Anderson-Darling)  could  not.  The  procedure  not  only  found 
differences,  but  was  able  to  estimate  the  actual  relation  between  the  populations 
in  a  meaningful  way.  Interestingly,  the  < statistic  suggested  by  this  work  also 
detected  a  difference  in  the  two  populations. 

5.3.  The  Behrens-Fisher  Data 

A  simulated  data  set  exhibiting  the  Behrens-Fisher  problem  is  analyzed  in 
this  subsection.  The  Behrens-Fisher  problem  is  to  distinguish  between  two  nor¬ 
mal  populations  which  differ  in  their  mean  and  variance.  The  first  sample  is  a 
random  sample  of  size  30  drawn  from  the  N(0,1)  distribution.  The  second  sample 
is  a  random  sample  of  size  30  drawn  from  the  N(l,3)  distribution.  The  tests  useu 
in  this  example  will  clearly  indicate  that  the  null  hypothesis  is  false.  The  value 
of  such  an  example  is  to  demonstrate  the  extra  information  that  can  be  obtained 
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from  the  methods  presented  in  this  work. 

The  data  are  listed  in  Table  20  and  presented  graphically  as  Figure  48.  Even 
from  the  figure  it  appears  that  the  two  samples  are  different.  The  second  sample 
[Figure  48(b)]  appears  to  be  more  variable  if  not  possessing  a  greater  mean. 
Table  21  gives  the  sample  statistics  for  the  two  samples.  These  statistics  also 
indicate  likely  differences  in  the  two. 

Figures  49  through  51  present  the  identification  quantile  plots  for  the  first, 
second,  and  pooled  samples,  respectively.  One  would  certainly  not  detect  a 
difference  based  on  these  plots.  Given  the  origin  of  the  data,  one  would  not 
expect  to.  The  two  populations  are  the  same  up  to  location  and  scale.  These 
graphs  remove  the  effect  of  location  and  scale.  The  pooled  sample  is  bimodal, 
but  this  is  not  clear  from  the  identification  quantile  function.  Bimodality  often 
causes  the  identification  quantile  to  appear  short  tailed.  The  pooled  quantile  in 
Figure  51  doesn’t  appear  to  be  short  tailed.  Figure  52  presents  an  overlay  of  the 
identification  quantile  plots  for  the  two  samples.  The  two  appear  very  similar 
indeed. 

Figure  53  presents  the  QQ  plot  of  the  two  samples.  The  graph  is  somewhat 
deceiving  because  the  horizontal  and  vertical  axes  are  not  in  the  same  scale.  The 
values  above  (0,0)  do  deviate  substantially  from  the  diagonal.  The  deviations 
below  (0,0)  are  less  severe.  From  this  figure,  one  would  strongly  suspect  that 
these  two  data  sets  are  not  from  the  same  populations. 

Table  22  gives  the  Cramer-von  Mises,  Anderson-Darling,  Kolmogorov- 
Smirnov,  and  <Pq  3  statistics.  Each  rejects  at  the  5%  level.  The  Kolmogorv- 
Smirnov  also  rejects  at  the  1%  level.  That  H0  should  be  rejected  is  also  clear 
from  Figures  54  and  55.  Figure  54  presents  Dff(u).  It  never  falls  below  the 
reference  diagonal  line  and  the  deviation  from  the  diagonal  is  substantial.  The 
process  y/N\Dtf(u)  -u]  drives  the  point  home  in  Figure  55.  The  process  certainly 
does  not  have  the  character  of  a  Brownian  bridge  process. 

Figure  56  pictures  the  process  y/N\bh(u)  -  u).  The  bandwidth  is  selected 
in  the  same  manner  as  before.  Here  a  bandwidth  of  h  =  0.3  is  appropriate.  A 
bandwidth  of  h  =  0.2  might  also  be  used.  Figure  57  presents  the  corresponding 
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Table  20 


White  noise  data  exhibiting  *he  Behrens- Fisher  probltm. 

First  Sample 


0.243 

0.258 

1.082 

-0.897 

-0.713 

1.486 

-1.180 

-3.147 

0.722 

1.108 

2.048 

0.764 

0.062 

1.800 

0.684 

0.462 

-1.031 

-1.560 

-2.100 

1.100 

-0.250 

-0.272 

-0.432 

0.117 

0.047 

-1.221 

-0.800 

Second 

-0.370 

Sample 

0.585 

0.669 

2.858 

2.827 

-0.260 

1.202 

-0.290 

-1.202 

-1.280 

1.752 

0.183 

-0.731 

-1.859 

5.421 

0.376 

1.283 

1.876 

0.445 

3.599 

1.796 

-1.752 

-0.354 

3.455 

2.681 

-2.880 

1.599 

-0.166 

-0.796 

2.270 

2.387 

2.278 

1.524 

Fig.  48.  The  data  for  the  Behrens-Fisher  problem.  Figure  (a)  is  the  first  sample 
and  is  a  realization  of  a  random  sample  from  the  N(0,1)  distribution.  Figure  (b) 
is  the  second  sample  and  is  a  realization  of  a  random  sample  from  the  N(l,3) 
distribution. 
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Table  21 


Sample  statistics  for  the  Behrens-Fisher  problem  data. 


Statistic 

First 

Sample 

Second 

Sample 

Pooled 

Sample 

Median 

0.090 

1.243 

0.317 

Twice  Interquartile  Range 

3.044 

5.265 

4.567 

Maximum 

2.048 

5.421 

5.421 

Minimum 

-3.147 

-2.880 

-3.147 

Mean 

-0.025 

0.941 

0.458 

Variance 

1.347 

3.607 

2.672 

Standard  Deviation 

1.161 

1.899 

1.635 

Fig.  49.  The  identification  quantile  function  of  the  first  sample  of  the  Behrens- 
Fisher  problem. 


Fig.  50.  The  identification  quantile  function  of  the  second  sample  of  the  Behrens- 
Fisher  problem. 


Fig.  51.  The  identification  quantile  function  of  the  pooled  sample  of  the  Behrens- 
Fishcr  problem. 


Fig.  52.  An  overlay  of  the  identification  quantile  functions  of  the  two  samples 
of  the  Behrens- Fisher  problem.  The  solid  line  with  blocks  is  the  first  sample  and 
the  solid  line  is  the  second  sample. 


t 


171 


-4-5  -4  -3  -2  -1  1  1  2  3  4  5  4 


Fig.  53.  A  QQ  plot  of  the  two  samples  of  the  Behrens- Fisher  problem.  The  first 
sample  is  the  horizontal  axis  and  the  second  sample  is  the  vertical  axis. 

Table  22 

Two  sample  portmanteau  statistics  for  the  Behrens- Fisher  problem 


data. 


Statistic 

Observed 

Value 

5%  Critical 
Value 

1%  Critical 
Value 

Cramer-von  Mises 

0.623 

0.460 

0.740 

Anderson-Darling 

3.532 

2.490 

3.850 

Kolmogorov-Smirnov 

1.678 

1.360 

1.640 

*>0.3 

10.118 

7.097 

10.433 

Fig.  54.  The  sample  comparison  distribution  function  for  the  Behrens- Fisher 
problem. 


Fig.  55.  The  sample  null  empirical  comparison  distribution  process  for  the 
Behrens- Fisher  problem. 


Fig.  56.  Sample  paths  of  \/~N[b^{u)  -  u]  for  the  Behrens- Fisher  problem.  Figure 
(a)  is  constructed  with  h  =  0.1;  Figure  (ft)  with  h  =  0.2;  Figure  (c)  with  h  —  0.3; 
and  Figure  (d)  with  h  =  0.4. 


Fig.  57.  Boundary  kernel  estimates  of  the  two  sample  comparison  density  func¬ 
tion  for  the  Behrens-Fisher  problem.  Figure  (a)  is  constructed  with  h  =  0.1; 
Figure  ( b )  with  h  =  0.2;  Figure  (c)  with  h  —  0.3;  and  Figure  (d)  with  h  =  0.4. 

comparison  density  estimates  for  these  bandwidths.  A  truncation  point  of  M  - 
6  is  used  with  the  bandwidth  of  h  —  0.3.  This  truncation  point  includes  all 
eigenvalues  above  0.01. 

Table  23  gives  the  squares  of  the  components  of  the  kernel  density  process 
for  h  —  0.3.  The  criterion  function  C(fc)  for  a  size  of  5%  is  given  in  Figure  58. 
The  null  hypothesis  is  rejected.  A  subset  of  size  2  is  selected.  The  components  in 
the  most  significant  subset  are  2  and  1.  A  subset  of  size  3  containing  components 
2,  1,  and  5  might  also  be  considered.  The  critical  function  is  negative  for  each 
subset  which  yields  the  second  largest  value  of  C(ik).  The  value  of  XiCO-QS1/6) 
is  6.922.  Examining  the  squares  of  components  in  Table  23  one  finds  that  the 
independent  tests  method  would  fail  to  reject  H0. 

The  boundary  kernel  and  orthogonal  series  estimates  of  the  comparison  den¬ 
sity  function  are  presented  in  Figure  59.  The  orthogonal  series  estimate  is  some¬ 
what  smoother  than  the  boundary  kernel  estimate.  From  the  estimate,  it  appears 
the  two  samples  differ  mainly  in  scale.  However,  the  first  component  is  undeni¬ 
ably  large.  It  is  important  that  such  numeric  diagnostics  accompany  graphs  to 
help  direct  the  eye  to  important  features. 
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Table  23 


The  first  six  squared  components  of  the 
kernel  density  process  (h  =  0.3)  for  the 
Behrens-Fisher  problem  data. _ 


Component 

Number 

Squared 

Value 

1 

4.533 

2 

6.840 

3 

0.018 

4 

0.614 

5 

1.696 

6 

0.010 

Fig.  58.  The  criterion  function,  C(fc),  for  the  Behrens-Fisher  problem.  Also 
pictured  are  the  criterion  values  for  the  next  best  subset. 


Fig.  59.  The  boundary  kernel  estimate  ( h  =  0.3)  and  orthogonal  series  estimate 
( components  2  and  1),  of  the  comparison  density  function  for  the  Behrens-Fisher 
problem.  The  line  with  blocks  is  the  boundary  kernel  estimate  and  the  solid  line 
is  the  orthogonal  series  estimate. 

In  this  case  it  is  possible  to  compare  the  estimated  densities  with  the  true 
comparison  density.  Figure  60  presents  the  true  comparison  density  and  the 
orthogonal  series  estimate.  The  estimate  is  excellent  considering  it  was  derived 
from  two  samples  of  size  30.  In  terms  of  estimating  the  region  where  f  >  g, 
the  estimate  misses  on  the  left  on  an  interval  of  length  about  0.05  and  on  the 
right  on  an  interval  of  length  of  only  about  0.03.  The  square  of  the  £2  distance 
between  the  estimated  and  true  comparison  density  functions  for  the  boundary 
kernel  is  0.024  and  0.018  for  the  orthogonal  series. 

Each  method  (except  the  independent  tests  method  applied  to  the  com¬ 
ponents)  rejected  H0.  One  can  now  judge  the  relative  merit  of  each.  The 
Kolmogorov-Smirnov,  Cramer-von  Mises,  Anderson-Darling  and  statistics 
give  no  indication  of  why  they  reject,  only  that  they  do.  In  combination  with 
the  identification  quantile  functions  and  sample  statistics  (MQ,  DQ),  one  could 
discern  that  the  two  samples  differ  by  location  and  scale  factors.  If  the  two  dif¬ 
fered  by  higher  order  components,  these  relationships  would  be  much  harder  to 
identify  in  this  manner.  The  estimate  of  the  comparison  density  coupled  with 
the  components  as  diagnostics  are  as  equally  applicable  to  alternatives  affecting 
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Fig.  60.  The  orthogonal  series  estimate  (components  2  and  1)  and  the  actual 
comparison  density  function  of  the  Behrens- Fisher  problem.  The  line  with  blocks 
is  the  actual  value  and  the  solid  line  is  the  orthogonal  series  estimate. 


high  and  low  order  components. 

The  procedure  has  now  been  applied  to  two  data  sets.  The  first  consisted 
of  observed  returns  for  two  savings  and  loan  institutions;  the  second  was  data 
simulated  to  exhibit  the  Behrens- Fisher  problem.  The  first  case  exemplified  an 
alternative  which  affects  principally  higher  order  components.  The  Kolmogorov- 
Smirnov,  Cramer-von  Mises,  and  Anderson- Darling  tests  all  failed  to  detect  a 
difference  in  the  populations.  The  subset  chi-square  test  applied  to  the  com¬ 
ponents  rejected  H0.  The  estimate  of  the  comparison  density  gave  an  excellent 
graphical  presentation  of  the  relation  of  the  two  densities.  The  second  data  set 
exemplified  an  alternative  for  which  every  method  rejects.  Yet  even  here  the 
estimate  of  the  comparison  density  along  with  the  components  as  diagnostics 
presented  the  relation  of  the  populations  in  a  clear  and  concise  manner. 
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6.  CONCLUSIONS 


6.1.  Conclusions 

This  work  has  sought  to  expand  and  refine  the  traditional  analysis  of  two 
samples.  The  commonly  used  techniques  were  conceived  in  an  era  when  comput¬ 
ing  facilities  were  a  true  constraint  on  what  it  was  possible  to  do.  Computing 
facilities  no  longer  pose  a  constraint.  Indeed,  modern  desktop  personal  comput¬ 
ers  and  workstations  are  largely  wasted  on  many  traditional  statistical  methods. 
The  analysis  of  two  samples  is  one  such  area.  A  basic  goal  of  this  work  has 
been  to  find  a  procedure  more  suited  to  the  graphical  and  interactive  computer 
environments  now  available. 

A  good  deal  more  was  sought  in  this  work  than  just  “gee  whiz”  type  graphics 
and  some  number  crunching.  The  desire  has  been  to  find  a  mode  of  analysis  which 
provides  a  deeper  understanding  of  the  relation  of  the  two  populations  under 
study.  The  philosophy  has  been  that  such  a  deeper  understanding  is  possible  now 
that  numerically  intensive  methods  are  not  ruled  out  and  high  quality  graphics 
in  real  time  are  available. 

Several  desirable  features  that  a  procedure  should  possess  were  defined.  It 
was  desired  to  make  minimal  assumptions  about  the  distribution  functions  of  the 
two  populaC-ms.  A  portmanteau  test  was  desired  to  avoid  specifying  too  closely 
the  relation  of  the  two  distribution  functions  under  alternatives.  Similarly,  a 
nonparametric  test  was  required  to  avoid  assuming  a  parametric  family  for  the 
two  distribution  functions.  Finally,  it  was  desired  to  estimate  the  relation  of  F 
to  G  when  H0  is  rejected.  Most  existing  two  sample  techniques  fail  to  enlighten 
one  in  this  regard. 

Upon  reviewing  existing  methodologies,  it  was  seen  that  the  comparison  den¬ 
sity  is  an  important  object  in  regards  to  several  of  these  goals.  The  comparison 
density  is  uniform  if  and  only  if  H0  is  true.  It  is  interpretable  as  the  likelihood 
ratio  of  the  density  of  the  first  sample  to  that  of  the  pooled  sample.  An  estimate 
of  this  density  proves  useful  in  two  ways.  First,  it  can  be  tested  for  uniformity 


as  a  means  of  testing  H0.  Second,  it  can  serve  as  an  estimate  of  the  relation  of 
the  densities  of  the  two  samples. 
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Parzen  (1983)  introduced  a  natural  estimate  of  the  comparison  distribution 
function.  The  comparison  distribution  function  is  simply  the  distribution  func¬ 
tion  associated  with  the  comparison  density  function.  This  estimator  is  called 
the  sample  comparison  distribution  function.  The  form  of  the  sample  comparison 
distribution  function  suggested  strongly  that  a  nonparametric  density  estimator 
be  used  to  estimate  the  comparison  density  function.  Results  were  to  be  derived 
based  on  the  weak  convergence  of  a  centered  and  scaled  version  of  the  sample 
comparison  distribution  function.  This  centered  and  scaled  process  was  called 
the  empirical  comparison  distribution  process. 

The  relevant  nonparametric  density  estimation  techniques  were  reviewed  and 
it  was  decided  to  use  the  boundary  kernel  modification  method  of  Gasser  and 
Muller  (1979).  Several  pointwise  results  were  proved  for  the  boundary  kernel 
estimator  of  the  comparison  density  function.  Assuming  the  bandwidth  shrinks 
to  zero  at  an  appropriate  rate  and  several  conditions  on  the  kernel  hold,  it  was 
proved  that  the  boundary  kernel  estimate  is  asymptotically  normal  under  H0. 
Assuming  a  shrinking  bandwidth,  mild  conditions  on  the  kernel,  and  that  the 
proportion  of  the  total  sample  represented  by  the  first  sample  doesn’t  change, 
the  pointwise  weak  consistency  of  the  boundary  kernel  estimate  was  proved.  The 
boundary  kernel  estimate  w as  also  seen  to  be  asymptotically  invariant  as  to  the 
choice  of  which  sample  is  called  the  first. 

A  stochastic  process  called  the  kernel  density  process  was  introduced.  It  is 
a  centered  and  scaled  version  of  the  boundary  kernel  estimate  of  the  comparison 
density  function.  The  weak  convergence  of  the  stochastic  process  was  proved  as¬ 
suming  mild  conditions  on  the  underlying  kernel  and  a  fixed  bandwidth.  Several 
rationales  for  a  fixed  bandwidth  were  given.  Among  these  were:  (l)  under  H0 
the  boundary  kernel  estimate  is  asymptotically  unbiased  for  fixed  bandwidths 
and  (2)  the  fit  of  the  small  sample  distribution  to  the  asymptotic  distribution 
is  better  if  the  latter  is  derived  under  a  fixed  bandwidth.  The  kernel  density 
process  forms  the  basis  of  testing  the  null  hypothesis. 
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The  results  for  the  boundary  kernel  estimator  of  the  comparison  density 
function  are  significant  in  their  own  right.  No  comparable  results  exist  in  the 
literature.  Processes  such  as  the  kernel  density  process  are  quite  novel.  This  work 
marks  a  different  mode  of  thinking  about  kernel  density  estimates  in  general. 

A  statistic,  ip ,  was  defined.  It  is  a  scaled  version  of  the  square  of  the  £2  norm 
between  the  boundary  kernel  estimate  of  the  comparison  density  function  and  the 
uniform  density  function.  While  investigating  its  limiting  distribution,  the  idea 
of  components  of  the  kernel  density  process  arose.  The  limiting  distribution  of  <p £ 
is  an  infinite  weighted  sum  of  squares  of  these  components.  The  components  were 
defined  in  detail.  They  were  seen  to  be  both  generalized  Fourier  coefficients  of  an 
orthonormal  expansion  of  the  kernel  density  process  and  linear  rank  statistics. 
The  orthonormal  basis  used  is  the  set  of  eigenfunctions  of  the  covariance  kernel 
of  the  kernel  density  process  under  He.  A  numerical  method  for  finding  these 
eigenfunctions  was  presented.  The  space  these  functions  span  was  seen  to  be  of 
interest.  This  issue  and  its  ramifications  were  investigated  but  not  resolved. 

The  components  were  seen  to  be  of  more  interest  than  the  statistic  which 
motivated  them.  It  was  proposed  to  base  a  test  of  the  null  hypothesis  on  the  first 
M  components  and  to  give  each  equal  weight  in  the  test.  This  is  in  contrast  to 
standard  portmanteau  statistics  and  which  employ  all  the  components  but 
successively  downweight  them. 

Under  H0  the  components  were  proved  to  be  asymptotically  iid  N(0,1).  A 
method  to  test  the  components  was  needed.  There  are  no  optimal  tests  such  as 
UMPU  tests  in  this  context.  The  two  commonly  used  tests  are  the  chi-square 
and  the  independent  tests  method.  A  new  test  was  proposed  instead  of  using  one 
of  these.  The  new  test  is  called  the  subset  chi-square  test.  It  rejects  H0  if  and 
only  if  the  sum  of  squares  of  some  subset  of  the  M  components  is  found  to  be  too 
large.  Unlike  the  chi-square  test,  this  test  indicates  which  components  are  found 
to  be  significant.  Unlike  the  independent  tests  method,  the  subset  chi-square 
explicitly  considers  the  components  together  and  not  just  singly.  As  measured 
by  power,  the  subset  chi-square  test  was  seen  to  be  a  good  compromise  between 
these  other  two. 
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The  subset  chi-square  lends  itself  well  to  graphical  display.  Critical  values 
for  the  test  were  found  by  simulation.  By  indicating  components  which  are  sig¬ 
nificant,  the  subset  chi-square  test  suggested  a  subset  orthogonal  series  estimator 
of  the  comparison  density.  The  relation  of  the  orthogonal  series  estimator  to  the 
boundary  kernel  estimator  was  investigated.  The  boundary  kernel  estimator  was, 
itself,  shown  to  be  a  damped  orthogonal  series  estimator.  The  orthogonal  series 
estimator  suggested  by  the  subset  chi-square  test  simply  includes  or  excludes 
particular  frequencies. 

The  power  of  the  subset  chi-square,  the  Cramer- von  Mises,  and 
Anderson-Darling  statistics  were  compared  in  Section  4.  These  last  two  are 
commonly  used  portmanteau  statistics.  Powers  for  these  tests  were  found  for 
local  location  and  scale  alternatives  for  the  normal  and  Cauchy  distributions. 
Power  functions  were  also  calculated  for  what  were  termed  Fourier  alternatives. 
The  two  Fourier  alternatives  used  stressed  the  third  through  sixth  components. 
Location  and  scale  alternatives  significantly  affected  only  the  first  through  third 
components. 

The  weak  convergence  of  the  empirical  comparison  distribution  process  under 
local  alternatives  was  proved.  The  power  of  the  subset  chi-square  test  applied 
to  the  components  of  the  kernel  density  process  was  found  by  simulation.  For 
the  Cramer-von  Mises,  and  Anderson-Darling  statistics,  power  functions 
were  found  by  numerically  inverting  an  approximation  to  their  characteristic 
functions.  A  new  method  for  this  inversion  based  on  the  FFT  was  introduced. 
A  theorem  concerning  the  numerical  consistency  of  the  method  was  proved. 

The  subset  chi-square  was  seen  to  perform  very  well.  The  traditional  statis¬ 
tics  performed  better  for  location  alternatives  which  affect  mainly  the  first  com¬ 
ponent.  Their  advantage  disappeared  for  scale  alternatives.  For  the  Fourier 
alternatives,  the  standard  statistics  were  found  to  be  greatly  lacking  power  in 
comparison  to  the  subset  chi-square  test.  These  statistics  may  be  consistent 
tests  and  the  subset  chi-square  not,  yet  this  seems  little  solace  given  their  dismal 
performance  as  measured  by  power. 

Lest  one  believe  that  alternatives  seen  in  practice  are  only  location  or  scale, 
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an  example  was  given  in  Section  5.  The  data  were  observed  weekly  returns  to 
two  savings  and  loans  over  a  103  week  period.  All  the  standard  portmanteau 
tests  failed  to  reject  H0.  The  subset  chi-square  rejected  H0  at  the  5%  level.  The 
fourth  component  was  dominant.  Further,  the  estimate  of  the  comparison  density 
pictured  the  relation  of  the  densities  of  the  two  populations  in  an  understandable 
manner.  A  second  example  was  also  analyzed.  This  data  was  simulated  to  exhibit 
the  Behrens- Fisher  problem:  the  two  samples  were  normal  but  with  different 
means  and  variances.  The  differences  were  sufficiently  large  for  the  standard  tests 
to  detect  them.  Yet  even  in  this  case  the  procedure  based  on  the  comparison 
density  gave  unity  and  insight  into  the  relation  of  the  two  samples. 

In  summary,  the  unified  techniques  based  on  the  boundary  kernel  estimate  of 
the  comparison  density  achieve  what  was  set  out  in  Sections  1  and  2.  The  proce¬ 
dure  is  unified  and  self-contained.  The  test  has  good  power  against  a  wide  range 
of  alternatives.  In  fact,  the  breadth  of  this  class  is  selected  by  the  researcher. 
The  procedure  has  many  useful  and  informative  graphical  elements.  The  class  of 
distributions  to  which  it  applies  is  very  broad.  When  the  test  rejects  H0,  it  is  also 
simultaneously  selecting  an  orthogonal  series  estimate  of  the  comparison  density 
function.  The  orthogonal  series  estimate  is  intimately  related  to  the  boundary 
kernel  estimate.  The  technique  has  been  given  a  rigorous  theoretical  foundation. 
Its  use  will  give  the  researcher  an  opportunity  to  more  thoroughly  understand 
his  data  and  the  information  it  contains. 

6.2.  Areas  for  Future  Research 

It  is  only  natural  to  inquire  where  a  piece  of  research  will  lead.  Are  there 
opportunities  for  expanding  its  scope?  Are  there  other  areas  to  which  it  applies? 
For  the  methods  considered  in  this  work,  the  answer  is  “yes”  to  both  of  these 
questions. 

This  research  has  concerned  itself  with  the  two  sample  problem.  One  could 
term  it  k  =  2.  It  is  only  natural  to  ask  about  k  >  3.  This  is  the  so-called  fc-sample 
problem  and  it  should  be  an  area  rich  for  research.  The  basic  approach  would  be 
to  consider  the  ranks  of  each  sample  in  a  pooling  of  all  k  samples.  One  then  has 


k  —  1  independent  comparison  densities  to  estimate.  The  choice  of  bandwidth 
across  the  samples  and  the  method  of  testing  the  components  would  need  to  be 
considered  in  depth.  There  are  certainly  substantive  issues  to  be  addressed. 

The  alternative  to  increasing  k  above  2  is  to  decrease  it  to  1.  This  is  known 
as  the  one  sample  problem.  The  one  sample  problem  has  just  a  single  sample  and 
tests  a  hypothesis  of  the  form  H„:  F  =  F0,  where  F0  is  some  specified  distribution 
function.  The  methods  of  this  dissertation  should  apply  almost  wholesale  to  this 
problem.  All  one  does  is  exchange  the  empirical  comparison  distribution  process 
for  the  uniform  empirical  process.  The  rest  of  the  analysis  should  apply  almost 
directly. 

In  summary,  there  are  very  good  prospects  for  expanding  the  methods  of 
this  work  to  related  problems.  The  two  most  likely  candidates  for  investigation 
are  the  k  sample  problem  and  the  one  sample  problem. 
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APPENDIX  A 
GLOSSARY  OF  NOTATION 

Appendix  A  is  a  glossary  of  the  notation  used  in  this  work.  In  general,  if 
a  function  is  subscripted  by  (TV),  the  function  is  not  random.  The  dependence 
on  N  is  through  a^j,  the  fraction  of  the  pooled  sample  represented  by  the 
first  sample.  For  example,  the  function  D^j(u)  is  not  random.  If  a  function 
is  subscripted  by  N ,  then  that  function  is  random.  For  example,  the  function 
Dff(u)  is  random.  Following  is  a  list  of  the  notation  used  in  this  work  along  with 
a  brief  explanation. 

•  The  first  sample  is  X\t . . . ,  Xm  which  is  iid  with  distribution  function  F,  density 
function  /,  and  quantile  function  QF . 

•  The  second  sample  is  Yi,...,Yn  which  is  iid  with  distribution  function  G, 
density  function  g,  and  quantile  function  QG . 

or 

•  The  second  sample  is  Yn i, . . . ,  Ynn  which  is  iid  with  distribution  function  G(nj, 
density  function  g^nj,  and  quantile  function  for  local  alternatives. 

•  The  first  sample  has  m  observations. 

•  The  second  sample  has  n  observations. 

•  The  pooled  sample  has  N  —  n  +  m  observations. 

•  The  empirical  distribution  functions  are: 

Fm  for  the  first  sample, 

Gn  for  the  second  sample, 

Htf  for  the  pooled  sample. 

•  The  empirical  quantile  functions  are: 

Qm  for  the  first  sample, 

QG  for  the  second  sample, 

rr 

Qff  for  the  pooled  sample. 
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•  The  ratio  of  the  size  of  the  first  sample  to  the  size  of  the  pooled  sample  is 

*(A0  =  m/N- 

This  fraction  satisfies  one  of  the  two  following  conditions: 

— ►  Ao  as  m  A  n  — ►  oo,  0  <  Ao  <  1, 

^(N)  =  0  <  Aq  <  1. 

•  The  true  or  population  distribution  function  of  the  pooled  sample  is 

H{N)(x)  =  A^)F(i)  +  (1  -  A^j)G(z), 
or 

H0(x)  =  A 0F(x)  +  (1  -  Ao)G(z), 
or 

H(N)(x)  ~  \n)F(x)  +  (!  “  A(Ar))G(n)(z)> 

for  fixed  m  and  n,  as  m  A  n  — ♦  oo  and  for  local  alternatives,  respectively. 

•  The  true  or  population  quantile  function  of  the  pooled  sample  is 

<?(AT)(U)  = 
or 

Qo  (“)  = 

for  fixed  m  and  n  and  asmAn-*  oo,  respectively. 

•  The  sample  comparison  distribution  function  is 

=  (HnQ?rH :«)• 

•  The  population  comparison  distribution  function  is 

£(#)(“)  =  F<2(JV)(U)> 

DM  =  FQ?(  u), 

DM  =  FQf  («), 


for  fixed  m  and  n,  as  m  A  n  — ►  oo,  and  for  A^j  equal  to  arbitrary  A,  respectively. 
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•  The  population  comparison  density  function  is 

d(N)N  =  *>(#)(“)> 

d0{u)  =  D'oCu), 

d\{u)  = 

for  fixed  m  and  n,  as  mAn  — >  oo,  and  for  A(#j  equal  to  arbitrary  A,  respectively. 

•  The  comparison  distribution  empirical  process  is 

CDjvOO  =  v/tf[D*(«)  -  D(JV)(u)]. 

•  The  null  comparison  distribution  empirical  process  is 

CDojy(u)  =  s/N[Dn(u)  -  u], 

which  equals  CD#  under  H0. 

•  The  limiting  process  of  CD#(u)  is 

L(u). 

•  The  boundary  kernel  estimate  of  the  comparison  density  is 

M*>)  =  \JQ  ~  u]/h)dD#(u). 

•  The  sample  kernel  density  process  is 

KDPa^M  =  i  f K,({w-u)/h)dCDN{u). 

•  The  null  sample  kernel  density  process  is 

KDPo#  a(u;)  =  j^J  ks[{w  ~  u]/h)dCDo#(u) 

=  \^N\dh{w)  -  1]. 

•  The  limiting  process  of  the  sample  kernel  density  process  is 

KDP k(w)  =  ~  j  K',(\w  -  u\/h)L(u)iu. 

•  A  normalized  estimate  of  Pearson’s  phi-squared  statistic  is 

- 1|W 


It  converges  in  distribution  to 
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APPENDIX  B 

PROOFS  OF  THEOREMS  AND  LEMMAS 


Proof  of  Theorem  3.2.1 

Before  starting  on  the  proof  of  Theorem  3.2.1,  a  quote  from  Van  Zwet  (1983) 
seems  appropriate.  He  discusses  the  proof  of  the  original  Chernoff-Savage  theo¬ 
rem  and  subsequent  developments: 

“It  (Chernoff  and  Savage’s  proof)  struck  terror  into  the  hearts  of  graduate 
students  at  the  time  because  of — what  was  then  considered — its  extreme  tech¬ 
nicality;  in  order  to  approximate  the  rank  statistic  by  a  sum  of  independent 
random  variables  no  fewer  than  six  remainder  terms  were  shown  to  tend  to  zero, 
each  for  its  own  particular  reason.  Unfortunately,  the  number  of  such  remainder 
terms  has  increased  monotonically  over  the  years  and  nowadays  authors  in  this 
area  appear  to  need  at  least  fifteen.” 

Start  by  writing  \ZiVA[d/l( w)  —  l]  as 

Let  t  =  Qf (u)  and  perform  a  change  of  variable  to  arrive  at  the  following  equiv¬ 
alent  expression: 

VWk \J  l<r((io  -  HNQF(u)\/k)dFmQF(u)  -  1  . 

These  statements  hold  true  for  each  w  €  (0, 1)  for  h  sufficiently  small.  Define  the 
uniform  empirical  distribution  function  for  the  first  sample  as  r£(u)  =  FmQF(u) 
and  rjf(u)  =  HffQF(u),  for  the  pooled  sample.  Under  H0,  this  last  process  is 
also  a  uniform  empirical  distribution  function.  Substituting  these  quantities  in 
the  above  integral,  one  arrives  at 

y/Nh{dh{w)  -  1] 

(b.i)  =  v'ATfc  !«•([».  -  r#M|//>)rfr£(a)  - 1  . 


The  mean  value  theorem  states  that  for  each  s  €  [0,  l]  there  exists  a  point  (s) 
between  rjy(s)  and  s  such  that 

«■([>»  -  r#(»)]//>)  =  K(\ w  -  sW  -  i  *'([«,  -  tN W]/k)[rg  W  -  *] 

The  result  of  the  application  of  the  mean  value  theorem  can  now  be  substituted 
into  (B.l).  This  yields 

Jq  ^([u; -s]/h)  -  ^A’,([ti;-tiV(s)l/h)[r^(s) -s]  dT^{s)  -  l| 

=  Mn  -  a2 n  -  bin  -  B2N, 

where 

AlK  =  Vtfh  *([«>  -  *|//l)<ir£(j)  -  1  j  , 

Ain  =  ps  x'([“  -  ‘Wun  («)<*». 

B IN  =  ^  /o  [*'([«  -  (jvMI/fc)  -  Jf'du.  -  «|/fc)l  <(»)<*». 

Bin  =  ^  jf  K'{\w  ~  tNWmuSiiWmU)  -  *1, 

and 

£/#(«)  =  v^[r»  (,)-„] 

is  the  uniform  empirical  process  for  the  pooled  sample.  The  first  order  terms 
are  A\n  and  Air-  They  will  be  shown  to  converge  in  distribution.  The  second 
order  terms  are  B\n  and  Bin-  They  will  be  shown  to  converge  in  probability  to 
zero.  Start  with  the  first  order  terms.  One  can  rewrite  A\n  as 

ain  =  VnU  |  ^  p  1 r([w  -  ux\/h)  - 1 1 

where  U{  =  F(Xf)  is  uniformly  distributed.  One  can  rewrite  Ain  as 
a2N  =  Jq  K([w  -  s]/h)dU${s) 

=  VWh  jT1  K{\w  -  5]/h)dr^(s)  -  i  jf1  K{\w  -  s|/h)ds|  . 
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The  last  term  is  equal  to  1  for  h  small  enough,  so  one  has 

i  m  i  n 

M AT  =  TFT  £*([>"  ~  Oil/*)  +  TF*  E  KUW  -  W*)  -  *}' 

*=i  y= i 

where  Vy  =  F(Yj).  Now  examine  A\n  -  A2n- 


Ain  -  A2 n 

1  m  i  m 

=  ^2  «■((«-  -  01  I/A) -  — E  *■(!“  -  Oil/*) 

1=1  t=l 

-j^Ex(i'”-v'ji/*)-1+1} 

J=1 


=  '/**{  ^-^  E  *((“  -  o.i/*) 

1=1 


1  ~  V) 

nh 


1  ~  V) 
\/V) 


»=1 


Both  terms  in  this  last  equality  can  be  shown  to  converge  to  a  limiting  normal 
distribution  by  invoking  a  CLT  for  triangular  arrays.  Since  both  terms  are  inde¬ 
pendent,  one  can  find  the  limiting  distribution  of  the  sum  by  finding  the  sum  of 
the  limiting  distributions.  One  reaches  the  conclusion  that 


Ain  ~  A2n 


Z 


where  Z  has  the  normal  distribution  with  mean  0  and  variance 

f  m*dt 

Ao  J- 1 

since  h  — ♦  0,  (m  A  n)h  ->ooasmAn-*oo. 

Next  it  is  shown  that  B\n  converges  to  0  in  probability.  Start  by  bounding 
&IN- 

\Bin\  <  \Un(s)\]^5  Jq  -  *JvW]/fc)  -  K'dw  -  s]/fc)|ds- 
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For  rjy(a)  and  s  outside  the  range  (u>  —  h,ut  +  h),  the  integrand  is  identically 
zero.  Inside  this  range,  the  mean  value  theorem  and  the  assumption  that  the 
second  derivative  of  K  is  bounded  by  M  give 

*'(h»  -  i»WI/k)  -  K'([w  -  .)/k)|  <  £M.)  -  ,|. 

This  yields  the  following  bound  on  Biff. 

\Bin\<M  sup  f  -  s\Ih{s,T%{s))ds 

0<a<l  ™  Jo  h 

(B.2)  <M(  sup  |^(a)|)2.1  [l  hh(s,r%(s))ds, 

0<a<l  VNh3  JO  h 

since  \tff(s)  —  s|  <  |rjy(s)  —  s|  and  where  /^(s,  Tjy(s))  is  1  if  either  w  —  h  <  s  < 
w  +  hoTW  —  h<  r^(s)  <  tw  +  h.  An  equivalent  expression  for  /^(s,  rJfM)  is 

(B.3)  4(«.rffM)  =  4(.)  +  4(r#W)  -  4(*)4(r#«). 

where  /^(r)  =  1  if  w  -  h  <  r  <  w  +  h  and  0  otherwise. 

The  last  integral  in  (B.2)  must  be  evaluated.  Substituting  the  expression 
(B.3)  into  this  integral  results  in 

rl  -i  rl  |  rw+h  -I 

Jo  A/'-(s'r"W)*  =  2  +  /0  j-4(r#M &-Jw_k  j4(r?W )<u 

=  2  +  Rlff  +  R2N. 

The  function  rjy(s)  is  the  empirical  distribution  function  of  F( X{], . . .  ,F(ATm), 
and  F(yri),...,F(y'n)  which  are  distributed  as  N  iid  U(0,1)  random  variables 
under  H0.  One  can  bound  R\ft  by  [Q#(u;  +  1.5h)  -  Qff(u>  —  l.5h)\/h,  where 
Qff  is  the  empirical  quantile  function  of  these  N  random  variables.  This  last 
bound  looks  like  a  derivative.  Applying  Bahadhur’s  representation  to  the  sample 
quantiles,  one  can  show  that  [Qjy(ty  4-  1.5/i)  —  Qff(w  -  1.5 h)]/h  3  since 

(m  A  n)h 2  — »  oo.  The  term  R2s  Is  easily  handled  since  0  <  R2ff  <  2. 

Putting  all  the  components  together,  one  concludes  that 

Ifliwl  <  M(  *up  \ujj(s)\)2-—^  i4(s,rg(s))da 

=  Op{l)fy/Nh*  ±0 
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since  (m  A  n)*»  — ►  oo.  The  term  supo<s<i  \Uf$ (s)|  is  Op(l)  since  it  converges 
in  distribution  as  m  A  n  — ►  oo.  In  fact,  this  term  is  the  Kolmogorov-Smirnov 
statistic;  see  Shorack  and  Wellner  (1986),  page  91. 

It  must  now  be  shown  that  Bin  0*  The  term  Bin  13  written  as  \Bis\  < 

I  Ci AT  |  +  |C2at|  where 

Cw  =  |  jrjj  /0‘  [A-'((u>  -  tfi(s)\/h)  -  K'( \w  -  .]/*)]  Uj}(s)d\T^,{s)  - 
and 

ClN  =  |jn  £  K'(l“  -  -  *]  • 

Start  by  examining  C\n-  It  can  be  shown  that  C\s  is  bounded  by 

Ciat<  sup  /’  tI‘"W  -  *l4(*.rf  (•»*&(•> 

0<s<l  *  JO  h 

+  sup  [  ¥;M*)  - •\Ihi**TNi9))d* 

0<a<l  h  JO  * 

(B.4)  <  ( sup  |tf*MI)2-7=f{  /’  sA(*.rgW)ar£(.) 

o<a<i  vNh 3  KJ o  ft 

+  jf1l/*(s,rg(s))*}. 

From  above  it  is  known  that  the  second  term  in  the  brackets  is  Op(l).  Examine 
the  first  term  in  the  brackets: 

/o‘  ±4(s,rjJ(s))dr£(s) 

(B.5)  =  Jl  i/k(s)<ir£(s)  +  i  jf  *  /*(rg(s))<ir';(«) 

-  l/k(s)//.(rg(s))<irj;(s) 

The  first  term  of  (B.5)  is  equal  to 

m 

ass7*™ 

»=i 

which  is  a  kernel  density  estimate  at  the  point  w  using  the  uniform  kernel.  It 
converges  in  probability  [cf.  Parzen  (1962a)]  to  2.  The  second  term  of  (B.5)  is 
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handled  by 

0  <  jf  ‘  l/*(r^w)rfrj;{s) 

=  ^EA(ri5W)) 

1=1 

1=1 

t=l 

where  Zt  =  U{  for  *  =  1, . . . ,  m  and  Z,+m  =  Vt  for  t  =  1, . . . , ».  Hence  the  second 
term  is  Op(  1).  Similarly,  for  the  third  term  of  (B.5)  one  has 

0  <  fo  l4W/k(rg(»))dr^w 

<  jf  ‘  i/k(.)rfT'(.)  =  0,(1) 

Returning  to  equation  (B.4),  the  term  in  braces  has  been  shown  to  be  Op(l). 
Since  (m  A  n)h?  —*  oo  then  Cjjy  0  as  m  A  n  — »  oo. 

Now  examine  C2N ■  This  is  the  more  difficult  of  the  two  terms.  After  some 
rearranging,  C2N  can  be  written  as 


cw  =  -  2.1/M  - 1 

t=l 

1  171 

-  -rj  £  *'([«'  -  MI/M|r®(M)  -  Ml]  • 

"  1=1 

By  repeated  application  of  the  mean  value  theorem,  one  finds  that 

N  -  N 

DiN  =  £*(!»  ~  2.1/'*)  -  -  (*VJV)]/fc) 


1  =  1 
N 


»=1 


-  ^  £  *'((“  -  Mvl/M|r"(2.)  -  Zi  1 

i=l 


=  0, 
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where  is  between  and  rjy(Zj)  and  since  r#(Z(i))  =  i/N.  If  it  can  be 
shown  that 


d2N  =  E  K^w  -  (*/^)]/M  -  i]  -  0, 

*=1 

N 

D3N  =  -  rii¥|/*0[r#(^)  -  Zi\ 

»=1 

1  N  , 

*'([>" - Zil/*)[r %(Zi)  - 2,1]  0, 

t=l 

JV 

D,iv  =  E  *■'(!”  -  2.1/Mir^Zi)  -  2.1 

t=l 

1  m  l 

-  ^  E  “  £;.i/*)Ir"(c'.)  -  Oil]  o, 

»=1 

then  C2iv  0  since  C2#  =  —  I?ijv  —  —  Start  with  D2n-  Let 

5(0  =  J  ~j^K([w  —  s]/h)ds. 

By  Taylor’s  theorem  g(t)  =  +  r)  +  0.5(f  —  r)2gf'{c\  where  c  is  between 

t  and  r.  Let  t  =  (t  —  1) /iV  and  r  =  */JV.  Then 


»([•■  -  1JW  =  g(i/N)  -  ±K([w  -  (i/N)}/h)  +  0(l/N2h2), 


if  i/N  >  w  -  h  and  (i  -  1  )/N  <v>  +  h;  otherwise  g{i/N)  =  ff([»  -  l]/JV)  =  0.  So 

N  1 

£(*(»/*)  -  *((•'  ~  W#)  ~  nhk^w  ~  (•'/*)!/*) 

»= l 

N 

=  0(l/N2h2)I(i/N  >u>-  /»)/([»  -  1  }/N  <w  +  h). 

»=l 

There  are  about  2 Nh  terms  in  this  last  sum,  so  2NhO(l/N2h2)  =  0(l/Nh)  and 

ff  , 

5>(‘7*)  -  »([•■  -  lVN)  -  *}-*([«  -  (»'/*)]/*) 

t=i 
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1  N 

=  g{  i)  -  0(0)  K(\w  ~  i*/N)W 


t=i 


i  N 

= 1  -  m  £  *(!“  -  (Wl/M  =  o(i/Nh). 


t=l 


This  implies  that  D^s  —  0{\/yjNh )  and  hence  Din  0  since  (m  A  n)/i  — ►  oo. 
Next  examine  D$n-  This  term  can  be  bounded  by 

I-Dsjvl  <  sup  \T»(s)  -  s| 


0<a<l 
N 


■  ^  £  !*'([»  -  <•<*]/*)  -  *'(!«>  -  *]/*)! 
t=l 

<  8UP 


■  {hW  +  /k(rff(Z<))  -  4(Z04(rjif(z,))}. 

The  last  sum  results  from  the  mean  value  theorem  and  the  fact  that  K'{\w  - 
rtiv]//i)  =  0  and  K'{[w  -  Z^j//i)  —  0  if  both  Z,  and  r$(Zt)  are  outside  the 
interval  (to  —  h,  w  +  h).  The  first  term  in  the  braces  is  the  kernel  density  estimate 
of  the  uniform  density  (under  Ha)  based  on  the  Z,’s.  This  quantity  converges  in 
probability  to  2  as  (m  A  n)h  — *  oo.  The  second  term  is  actually  nonrandom  and 
is  just 


t=i 


which  converges  to  2  as  (m  A  n)h  — ►  oo.  The  last  term  in  the  braces  is  bounded 
by  both  the  first  and  second  terms.  Thus 

\Dzn\  <  (  sup  |t/#(s)|)2^2=Op(l) 

0<a<l  VNh3 

and  so  one  may  conclude  that  D$n  0  since  (m  A  n)h?  — *  oo. 

The  term  D^n  is  the  most  tedious  to  deal  with.  It  can  be  rewritten  as 

Dm  =  (1  -  A,*,)  [^J  £  K'(\w  -  u,\/k)uH(u,) 

»= 1 
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The  claim  is  that  D^s  0-  One  needs  to  show  that  E[D^]  — »  0.  Begin  by 
evaluating  the  expectation 

E  [j^jr'(|<*  -  2.|/A)Jf'(|w  -  Zj\/k)ug (Z,)V g (Z,)] . 

Start  for  i  ^  j  and  first  find  the  expectation 

E|£/^(Z,)i;jJ(Z2)|Zi  =  «,Z2  =  t| 

After  several  pages  of  algebra,  this  expectation  is  found  to  be 


min(s,i)  —  st  +  0(l/N). 

One  can  then  find  the  desired  expectation  from  the  conditional  expectation: 

E[pff'([<»  -  Zil/MK'd"  - 

=  Ez„Z,[pK'(l»>  -  Zjl/MK'd*  -  Zy/fc)E[l#(Zi)t#(Z,)|Z,,Z,]] 

=  ^  j  j  K,(u')K'(v)  ^minfui  —  hu,w  —  hv)  —  (w  —  hu)(ui  -  /iv)  j  dudv 
+  0(l/JVfc) 

=  C(M, 

for  i  ^  j.  The  procedure  is  similar  for  i  =  j  and  one  arrives  at 

E[^jr'([»-z1)/*),v#(zl)1i 

1  Z"1 

=  ^2  J  K'{y)2{w  -  J*v)(l  -  w  +  hy)dy  +  0{\/Nh2) 

=  V(A)  1 

Putting  these  results  together,  one  finds 

E1Djwl .  m + m  _  £w  _  £w . 


m  n  m  n 


It  is  obvious  that 


n 

V[h) 

m 

CW 

n 

C(h) 


0  if  nh 2  — ♦  oo, 


0  if  mhr  — ►  oo, 


0  if  n/i  — »  oo, 


0  if  mh  — *  oo, 
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and  hence  that  D\n  0  which  implies  that  D±n  0.  Backing  up  through 
the  remainder  terms,  it  has  been  shown  now  that  C2N  0  so  that  B2N  0. 
An  application  of  Slutsky’s  theorem  to  A\n ,  A2N1  Biff,  and  B2N  yields  the 
desired  result. 

Proof  of  Theorem  3.2.2 

Theorem  3.2.2  is  concerned  with  the  consistency  of  the  boundary  kernel 
estimator  of  the  comparison  density  function.  Start  by  writing 

I dh(w)  ~  do{u>) |  <  Aiff  +  A2n  +  Mn, 

where 

Ms  =  -^|^ /  *■([<» -«]/*MCDAr(«)|, 

Ain  =  I \j  ■K'd'"  -  “1  -  <<(jv)M|. 

Mn  =  |rf(AT)(w)  ~<Mto)|. 

As  in  Theorem  3.2.1,  since  h  -*■  0  it  is  not  necessary  to  worry  about  bound¬ 
ary  modifications.  Each  of  these  terms  must  be  shown  to  converge  to  zero  in 
probability.  Start  with 

Ain  <  fQ  lCD* (tt)ll-K‘'([u;  “ 

<  sup  |CDtf(u)|-^L-  f  \K'(y)\dy. 

0<a<l  y/NhJ-i 

One  can  show  supo<a<i  |CDjy(u)|  is  Op(l)  in  the  following  manner.  It  is  known 
(see  Section  2)  that  CD n  converges  to  a  limiting  process  under  fixed  alternatives 
as  well  as  under  H0.  By  Theorem  3.11  of  Ruymgaart  (1988)  it  follows  that 
this  term  converges  in  distribution  to  a  limiting  random  variable  and  hence  it  is 
Op(  1).  It  follows  that  Ain  0  since  (m  A  n)h2  — ►  00  by  assumption. 

The  proof  that  A2N  ~ 1 ►  0  follows  that  of  Theorem  1A  of  Parzen  (1962a).  Let 

1  f1 

9(N)(w)  =  hJQ  K(\w~  U1  /h)d{N)(u)dxt 
=  j_oo\^{y/h)d{ff){w-y)dy 
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for  h  sufficiently  small.  Pick  6  >  0  and  write 
\g(N)(w)  -  d(N){w)\ 

<  | /  <  [«*)(•  ”  V)  “  d(AT)(u,)l^A'(yM)^| 

\y\^6 

+ 1 J  [«*)(•  -  y)  - 

<  max  |d(jv)(ti;  -  y)  -  d(jy)(u/)!  I  \K{z)\dz 

ly|<6  J\z\<o/n 

+  [  ~d(ff){w  -y)%K(y/h)dy 

J\y\>sy  1  1  h 

+  d(*)M  [  rK(y/h)dy 
J\y\>6  » 

+  7  sup  \zK(z)\  f  d(jv)(y)dy  +  d(tf)M  f  K(z)dz. 
*  |*|>«/fc  •'-oo  •'M>*/* 


The  strategy  is  to  let  m  A  n  — ►  oo  for  fixed  6  and  then  let  6  —*  0.  The  last  two 
terms  tend  to  zero  as  m  A  n  — ►  oo  and  h  — ►  0.  This  leaves  only  the  first  term. 
Rewrite  the  first  term  as  follows: 

max  H(JV)(“'  ~  y)  ~  <*(*)(«>)  I  ^  max  !dny)(u;  -  y)  -  d0{w  -  y)| 

|y|<«  v  '  |y|£* 

+  max  |d0(w)  -  d{^(iy)|  +  max  \d0(w  -  y)  -  d0(t/;)|. 

Ivl<*  '  |y|^ 

The  second  term  tends  to  zero  as  m  A  n  — ♦  oo  because  d^jyj  is  converging  to  d0. 
The  third  term  tends  to  zero  as  6  —*  0  because  d0  is  continuous.  This  leaves 
only  the  first  term.  It  will  tend  to  zero  if  djjyj  converges  to  d0  uniformly.  If 
■*(jV)  =  then  d^  =  d0  and  Theorem  1A  of  Parzen  (1962a)  can  be  applied 
directly  to  show  that  A2S  0-  Since  d^  converges  to  d0 ,  A$pt  —*  0. 

Conclude  that  under  these  conditions,  one  has  d^xv)  —*  d0{w). 


Proof  of  Lemma  3.2.1 


Lemma  3.2.1  states  that  the  Gasser-Muller  boundary  kernel  satisfies  the 
Regularity  Conditions.  It  must  be  shown  that  for  all  £  >  0  there  exists  a  6  =  5(e) 
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such  that 

/  \K's([w  -  u]/h)  -  K^{\v  -  u]/h)\du  <  < 

JO 

if  |tu  —  uj  <  6.  Define 

P  =  {u  :  A(w ,  [u;  —  u]/h)  =  A(v,  [u  -  u]//i)  =  1}, 

Q  =  { u  :  A(w,  [w  -  u]/h)  #  A(v,  [v  -  u]/h)}, 

where 

{/(-l,l](t)  ifh<s<l  —  A 

/[(s  -  1  )/A,  lj(f)  if  1  —  h  <s  <1 
/[— 1,  s//i](t)  if  0  <  a  <  h, 

and  I  is  an  indicator  function.  The  function  X(s,  t)  is  indicating  the  region  where 
the  boundary  kernel  is  non-zero.  One  can  now  write 

fl\K',(\w-u]/h)-K'A\v-u\/h)\du 

Jo 

=  Jp \Ki\v  -  «1M)  -  k'A\ v  -  u\/h)\du 
+  /  iKd”  -  «]/*)  -  k'A\v  - 

(B.6)  <  sup  |#fj([u>  -  u]/A)  -  K'8,{\ v  ~  «]/*)! 

u€P 

+  /  l^([w-«]/*)--*£(l»-  u]/h)\du, 

JQ 

where  s  =  s(tu,A)  and  s'  =  s(v,h)  index  the  boundary  kernel  (see  Section  2). 
Concentrate  on  the  first  term  of  (B.6).  For  u  €  P,  one  has  A(w,  [u;  —  u]/h)  =  1 
and  A(v,  [v  -  u]/A)  =  1,  so  that 

sup  |/fj([w  -  u]/h)  -  K'3,{{v  -  uj/h)| 
ueP 

=  sup  |(0a  +  4>a{w  -  u)lh)K'{\w  -  u]/A)  +  <t>aK([w  -  u]/A) 
u6  P 

-  {0a'  +  4>s'iv  -  «)/*)#'([«  -  u\/h)  -  4>siK{\v  -  «]/*0l 
<  (A/j  +  M2  +  A/3  -f-  2 A/4  +  A/5)c*, 

if 

|w  -  v\  <  hs(e*)  =  hmin(S1(e*),S2(e*),63(€*),64((,),Ss(et)), 
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where 

M\  =  sup \K'{u)\ 

u 

M2  =  sup  |0a| 

0<a<l 

M3  =  sup  |siif,(5)| 

M4  =  sup  j<£,| 

0<J<1 

Ms  =  sup|K(s)|. 

The  £(e*)’s  result  from  the  uniform  continuity  (u.c.)  of  the  following  functions: 

*1(0  :  0a  is  u.c.  in  s 
^2(f*)  :  K'(t)  is  u.c. 

^(c*)  :  <j>a  is  u.c.  in  s 
£4(6*)  :  tK'(t)  is  u.c. 

£5(6*)  :  K  is  u.c. 

Thus  one  can  choose  e*  so  that 

(B.7)  sup  Kflu,  -  u\/h)  -  K'A\v  -  ■*]//*)!  <  </2 

sEP 

if  |w  —  v|  <  £(e). 

Moving  to  the  second  term  of  equation  (B.6),  the  claim  is  that  Q  is  either  (l) 
empty,  (2)  an  interval  of  length  less  than  |iu  — 1>|,  or  (3)  the  sum  of  two  intervals 
the  sum  of  whose  length  is  less  than  2|u/  —  v|.  J 

Define  |Q|  to  be  the  Lebesgue  measure  of  Q.  The  above  claim  is  proved  by  enu¬ 
merating  all  possible  combinations  of  cases  of  the  boundary  kernel:  left  boundary, 
interior,  and  right  boundary. 

1.  If  w  =  u  then  Q  is  empty. 

2.  If  0  <  w  <  h  then  /Ca([u;  -  uj/h)  is  nonzero  on  0  <  u  <  w  +  h;  if  0  <  v  <  h 
then  K'3, ([v  —  u]/h)  is  nonzero  on  0  <  u  <  v  +  h.  Thus  Q  has  measure  |tu  —  v|. 

3.  Ifl  —  ^<tw<l  then  Ka([w  -  uj  /h)  is  nonzero  on  w  -  h  <  u  <  1;  if 

1  -  h  <  v  <  1  then  —  u]/h)  is  nonzero  on  v  —  h  <  u  <  1.  Then 

I <21  =  \w-  v|. 
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4.  If  0  <  w  <  h  then  Ka([w  —  u]/h)  is  nonzero  on  0  <  u  <  w+h]  if  h  <  v  <  1  -h 
then  K'a,([v  —  u]//i)  is  nonzero  on  v-  h  <  u  <  v  +  h.  If  the  two  supports 
don’t  overlap,  then  |<3|  =  3h  4-  w  <  2\w  — 1>|.  If  the  two  supports  do  overlap, 
then 

j<2|  =  |u>  —  t/j  +  jv  —  hj  <  2| w  —  t/|. 

5.  If  h  <  w  <  1  —  h  then  Ka{[ tv  —  u]/h)  is  nonzero  on  w  —  h  <  u  <  tv  +  h;  if 
h  <  v  <  1  —  h  then  K'a,([v  —  u\/h)  is  nonzero  on  v  —  h<u<v  +  h.  If  the 
two  supports  don’t  overlap,  then  \Q\  =  4 h  <  2|u;  —  t>|.  If  the  two  supports 
do  overlap,  then  |Q|  =  2| w  —  v|. 

6.  If  I  —  h  <  w  <  1  then  K^dw  —  u]/h)  is  nonzero  on  w  —  h  <  u  <  1;  if 

h  <  v  <  1  —  h  then  —  u]/h)  is  nonzero  on  v  —  h  <  u  <  v  +  h.  If  the 

two  supports  don’t  overlap,  then  |Q|  =  Zh  +  1  —  w  <  2\w  —  v|.  If  the  two 
supports  do  overlap,  then 

|Q|  =  |u;  —  v|  +  |1  -  h  —  w|  <  2\w  —  t>|. 

7.  If  0  <  w  <  h  then  K'a([w —u]/ h)  is  nonzero  on  0  <  u  <  w  +  h;  if  1  —  h  <  v  <  1 
then  K's,{[v  —  u\/h)  is  nonzero  on  v  —  h  <  u  <  1.  If  the  two  supports  don’t 
overlap,  then  |Q|  =  1  +  2h  +  w  -  v  <  Ah  <  2\w  —  w|.  If  the  two  supports  do 
overlap,  then 

|Q|  =  \v  —  h\  -(- 11  —  h  -  u»|  <  2|u>  —  v|. 


Hence  |Q|  <  2|u;  —  t>|.  This  implies  that 

(B.8)  f  \Ka([w  —  u]/h)  —  Ka,{\v  —  u]/h)du  <  4M\u>  -  v\ 

JQ 

where  M  =  supa  t  |A"a(t)|  <  oo  because  K 1  is  continuous  and  6(f)  =  e/8 M. 
Finally,  choose 

6*(e)  =  min(6(e),f/8M). 

Combining  (B.7)  and  (B.8)  with  this  choice  of  6(c)  one  sees  that 

f  |H'i(lw-u]//i)-K;,([v-ul//i)|du<€ 

Jo 


if  |ti;  —  v|  <  6*(e).  Hence,  the  result  is  uniform  continuity  and  the  Regularity 
Condition  holds. 


Proof  of  Lemma  3.2.2 

It  is  to  be  proved  that  the  sample  paths  of  KDP/,  exist  and  are  continuous  with 
probability  1.  Start  the  proof  by  showing  that  L(u)  is  continuous  with  probability 
1.  Pyke  and  Shorack  (1968)  give  the  following  representation  for  L(u): 

t(u)  =  (1  -  Ml  -  (»)v„[x>f  («)]}, 

where  U0  and  V0  are  independent  Brownian  bridges  and 

Df(u)  =  FQ**  (u), 

=  cg?(.), 

doiu)  =  ^DoW)> 

M  =  £i>?M. 

QS  M  = 

Ho(x)  =  \0F(z)  +  (1  -  Ao)G(i). 

It  is  known  [see  Billingsley  (1968),  page  61]  that  the  sample  paths  of  a  Brownian 
bridge  are  continuous  with  probability  1.  As  a  result  of  the  assumptions  made 
about  the  two  distributions,  the  functions  DF ',  DG ,  dF,  and  dP  are  continu¬ 
ous.  Since  the  compositions,  products  and  differences  of  continuous  functions 
are  also  continuous,  conclude  that  the  sample  paths  of  L(u)  are  continuous  with 
probability  1. 

Since  L(u)  is  continuous  with  probability  1,  the  integral  defining  KDP^  exists 
with  probability  1.  Define  r(tu)  by 

rH  =  ^2  j0  K'^w  ~  ul/*0c(u)rfu> 

where  c(u)  is  any  continuous  function.  It  must  be  shown  that  r(u/)  is  continuous. 
This  is  done  by  bounding  |g(u;)  —  0(v)|  as  follows: 

lsM-ff(w)|<  sup  |c(u)|  /  \K'([w  -  u]/h)  -  K'{[v  -  u]/h)\du. 

0<u<l  JO 
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Since  e(u)  is  continuous,  supo<u<i  |c(u)|  <  oo.  It  was  shown  in  Lemma  3.2.1 
that  the  integral  can  be  made  smaller  than  any  e  >  0  if  |u;  —  v|  <  6(t).  Hence 
<7(tt>)  is  continuous  for  any  continuous  function  c(u).  Thus,  it  may  be  concluded 
that  KDP/t  is  continuous  with  probability  1. 

Proof  of  Theorem  3.2.3 

The  proof  of  Theorem  3.2.3  is  patterned  after  the  proof  of  Theorem  3.9  of  Ruym- 
gaart  (1988).  This  latter  theorem  concerns  the  weak  convergence  of  the  uniform 
empirical  process  to  a  Brownian  bridge  process.  Before  proving  Theorem  3.2.3, 
a  lemma  is  stated  and  proved. 

Lemma  B.l.  Let 

where  U  €  J9[0,  l];  K'a(t)  is  the  first  derivative  of  a  boundary  kernel  satisfying  the 
regularity  conditions  and  s  =  s(w,h).  Then  g{w)  is  uniformly  continuous  and 

sup  I&M -j(v)|  < -^  sup  |t/(t)|0(6), 

|u»-«|<tf  h  0<t<l 

where  0(<5)  is  defined  in  the  statement  of  the  Regularity  Conditions  in  Subsection 
3.2.2. 

Proof. 

The  term  |^(io)  —  g(w)|  can  be  bounded  by 

A  sup  |£f(i)|  [  \K3{{w  -  t\/h)  -  Ks'{[v  -  t}/h)\dt. 
h  0<<<1  JO 

Taking  the  supremum  of  the  above  over  |u/  -  v|  <  6  yields 

sup  \g[w)  -0(v)|  <  i  sup  |lf(0l®(0- 

|ti;— u|<5  ^  0<K1 

From  this  it  follows  that  g(w)  is  uniformly  continuous  since  the  bound  on  the 
right  does  not  depend  on  w  or  v.  The  bound  on  the  right  is  finite  since  U  e  D{ 0, 1] 
implies  that  supo<(<i  |I/(f)|  <  °°  (see  Gaenssler  (1983),  page  90,  for  a  statement 
of  this]. 
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The  strategy  for  proving  the  weak  convergence  of  KDPjv.fc  to  KDP^  is  to  show 
that 

E(ff(KDPjy)fc)]  -►  E[j/(KDP/,)]  asmAn-*  oo, 

where  g  is  any  bounded  and  p-uniformly  continuous  functional,  g :  C[0,  l]  — ♦  It. 
The  norm  p  is  taken  to  be  the  sup-norm. 

Start  the  proof  of  Theorem  3.2.3  by  defining  C*p  to  be 

Cp  =  |ff:  CIO,  1]  — ►  IR  :  g  is  p- uniformly  continuous, 

bounded  and  measurable.  j , 

and  choose  g  to  be  any  function  in  Cp.  Define 

^(c)  =  sup  |g(t)-s(s)|, 

p(s,t)<€ 

C  =  sup|g(t)|, 

t 

'fl(s)  for  s  =  0//,l//,...,(/  -  1  )//,///, 

Mr,*)  =  <  l[(3  ~  [*  ~ 

+(*'//  -  a)p((t  -  I]//)]  for  [*  “  <*<  'ti¬ 

lt  is  easily  seen  that  Ai(g;s)  is  piece-wise  continuous  with  nodes  at  (i/l,g(i/l)). 
To  show  that  the  expected  value  converges,  the  quantity 

|e|j(KDPjv,*)1  -  E|,(KDP*)]| 

will  be  shown  to  be  bounded  by  terms  decreasing  to  zero.  Start  by  applying  the 
triangle  inequality: 

|e[9(KDPjv,*)]  -  E[»(KDPk)|| 

<  |E(9(KDPjyA)|  -  E(9(A,(KDPAr,k))]| 

+  |e[»(A,(KDP*,*))|  -  E[s(A,(KDPk))]| 

+  |e[»(A,(KDP»))1  -  E[j(KDPk)]| 

(B.9)  <  «,(«)  +  2CP(9(KDP,v,k,  A,(KDPw,k))  >  <1 

+  |E[«(A,(KDPw,k))|  -  E[»(A,(KDPk))|| 

+  «,(«)  +2CPWKDPk,A,(KDP,,))  >  e|. 
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The  first  two  terms  of  (B.9)  result  by  splitting  the  space  Sp  into  two  subspaces: 
one  where 


piKDP^AAKDP^))  <  e 


and  one  where 


p(KDPff)fi,Ai(KD?Nth))  >  €. 

On  the  first  region,  ^(KDPjy/J  —  y(i4/(KDP^  ^))|  <  ^(c)  and  so  the  expected 
values  differ  by  less  than  5g(«).  On  the  second  region, 


lff(KDP N>h)  -  g{At( KDPjv|fc))|  <  2 C, 

and  so  the  second  term  results.  The  last  two  terms  of  (B.9)  are  derived  in  an 
identical  fashion. 


First  handle  the  term  P[p(KDPjy  /,,  -A^KDPjy^))  >  e].  By  Lemma  B.l,  one  has 
sup  IKDPtf^M  -  KDPtf,*(u)|  <  ±  sup  |CD*(OI»(f). 

|u/— v|<£  ^  0<t<l 

By  the  construction  of  A/,  it  can  be  seen  that 


p(KDP  ffh,  Aj(KDP  ff,h)) 


<  sup 
|u>-u|<l/l 


KDP*,k(u,)  -  KDP j\rpfc(v)  . 


These  two  facts  lead  to: 


P[p(KDPw,»,  A,(KDPWi*))  >  c] 

<P[i  sup  |CD„(J)|*(1/I)  >  s] 
h  0<K1 

=  P(  sup  |CDjy(t)|  >  h.2e/Hl //)] 

0<t<l 

— »  P(  sup  jL(t)|  >  h2e/0(l//)]  as  m  A  n  — »  oo 
0<t<l 

— »  0  as  /  — »  oo. 


The  convergence  of  the  probability  as  mAn-*  oo  is  a  consequence  of 

sup  |CDtf(t)|  sup  \L{t)\. 

0<K1  0<t<l 


This  convergence  in  distribution  is  a  result  of  the  weak  convergence  of  the  process 
CD#  to  the  process  L.  One  need  only  apply  Theorem  3.11  of  Ruymgaart  (1988) 
to  obtain  the  convergence  in  distribution  result. 

Examine  now  the  last  term  of  (B.9).  A  consequence  of  Lemma  (B.l)  is  that 
KDPfc  is  uniformly  continuous  with  probability  1.  This  result  implies  that 
p(KDP/,,  A/(KDP^))  — ►  0  as  /  -*  oo  with  probability  1.  Hence  one  has  the 
result 

P(p(KDP/n  Aj(KDPh))  >  tj  -  0  as  /  -  oo. 

All  that  remains  to  be  shown  is  that 

|e[j/(^((KDP jv,fe))]  -  E[»(J4,(KDPfc))]|  — *  0  as  m  A  n  — .  oo. 

It  can  be  shown  quite  easily  that  KDPjv,/»(w)  KDPfc(u;)  for  each  w  €  (0, 1]. 

This  is  shown  by  appeal  to  Theorem  3.11  of  Ruymgaart  (1988).  It  is  similarly 

shown  that  for  0  <  w\  <  v>2  <  .  •  •  <  tu*  <  1  and  (&i, . . . ,  6*)  6  IR*,  one  has 

k  k 

^tiKDP*, /,(«>,)  ^6,KDPk(u,,). 

t=l  »=1 

This  result  follows  from  the  above  convergence  in  distribution  and  the  fact  that 
integrals  are  linear  operators.  By  the  Cram4r-Wold  device,  one  may  conclude 
that 

(B.10)  (RDP^twi) . KDPW, *(«*))  -i  (KDP*(«;, KDP*(wt)) 

as  m  A  n  — »  oo.  Convergence  in  distribution  implies  that 

EIMKDPj^to), ....  KDP*. *(«*))]  -  E(A(KDP^(u»|) . KDP*(»t))] 

as  m  A  n  — ♦  oo  for  any  bounded  and  continuous  function  h :  IR*  — ►  IR.  Define 

^(z0,...,zf)  =  g{Ai{x)), 

where  z(s)  €  C[0,  l]  and  x ^  =  x(k/l).  The  function  <f>  maps  IR*+1  into  IR  since 
the  function  A/  depends  on  only  /  +  1  values  of  the  function  x(s).  Since  g  is 


bounded  and  continuous  in  the  sup-norm,  <f>  is  bounded  and  continuous  in  the 
Euclidean  norm.  Combining  (B.10)  with  these  results  concerning  <f>  yields 
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E[,U,(KDPW>))]  =  EWKDPtf, *(<)//),  KDP*ik(l/0, ....  KDP*. *(!//))] 

-  EWKDP*(0/i),  KDPk(l/0 . KDPk(///))| 

=  E[ff(X,(KDPfc))). 

The  convergence  occurs  as  m  A  n  — *•  oo  and  holds  for  all  /  >  1.  Hence 
Iem-mkdp^,,))]  -  E[s(A,(KDP*))||  0 

as  m  A  n  — ►  oo  for  all  I  >  1. 

Combining  all  these  results  one  has 


,“■5,  JSU.|*WKOPJrJJ|  -  E[9(KDPk)l|  =  2 6,(e)  -  0 
as  €  -*  0.  Thus  KDPjy  /j  =>  KDP^  in  (C[0,  l],  Cp,p)  as  m  A  n  -+  oo. 


Proof  of  Lemma  3.2.3 

The  covariance  kernel,  Ch(w,v),  of  v/<W(l  -  Aq)KDP^(u;)  is  given  by 

Ch(w,v)  ~  J~4  jQ  K3{\w  ~  «]/fc)  J  K'A\V  ~  t]/h)[min(s,t)  -  at\dsdt. 
See  Kannan  (1979),  page  154,  for  a  proof  of  this  result.  Letting 


(B.ll)  /'(„)  =  ijfJdu.  -  s)/h), 

(B.12)  , '«)  =  -  t]/h), 

an  equivalent  expression  for  u)  is 


Cfc(u;,v)=  f  /'(s)  f  gj(t)[m\n(s,t)  -  st}dsdt 
Jo  Jo 

nf'(s)g'{t)m\n(s,t)dsdt  -  f  sf'(s)ds  •  f  tg(t)dt. 
i  Jo  Jo 


(B.13) 
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The  first  term  of  (B.13)  is 
•1  e  1 


n/,(s)g,(t)  min {s,t)dsdt 

i 

=  [  /'(■»)  /  tg'(t)dtds  +  [  [  <J(t)dtds 

JO  JO  JO  Ja 

=  Jo  f'i8)  [*$(*) |g-  JQ  9(t)dt\ds 

+  \  «/'(«)[s(l)  “  9{s)]ds 

Jo 

=  lo  ^  ^  ~  Jo  ^ 

+  g{  1)  /  sf'(s)ds-  l  sg[s)f'(s)ds 

Jo  Jo 

=  -[  f  f'{s)g{t)dtds  +  g(l)/(l)  -^(1)  [  f{s)ds 
Jo  Jo  Jo 

=  /  ff(«)/(-s)<is  -  /(l)  f  g(t)dt  +  g(l)f(l)  -  g(l)  f  f(s)ds. 

Jo  Jo  Jo 

The  second  term  of  (B.13)  becomes: 

f  sf'(s)d3  ■  f  t</(t)dt 

Jo  Jo 

=  [/(i)  -  Jq  /(«)<**]  [s(i)  -  fQ  $(*)<**]  • 

There  is  much  cancellation  when  these  are  combined  to  find  C^w,  v).  The  result 
is 

Chiw>v)  =  [  9(*)f{s)ds-  [  f{s)ds-  f  g(t)dt. 

Jo  Jo  Jo 

Defining  f'(s)  and  g'(t)  as  in  (B.ll)  and  (B.12)  gives 

/(«)  =  ~\k3{[w  -  s]/h), 

9( 0  =  ~t]/h). 

The  final  form  for  C\( tu,v)  is  then 

Q»(«mO  =  ~  jf  Ka{[w  -  s\lh)Ka,{[v  -  s\/h)ds  -  1. 
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The  ‘—1’  arises  since  the  kernel  integrates  to  1. 


It  is  possible  to  derive  a  closed  form  representation  for  C^ti/.v).  This  is  very 
desirable  as  it  is  necessary  to  evaluate  C/Jto,  v)  many  times  while  approximating 
the  eigenvalues  and  eigenfunctions.  The  process  of  finding  this  representation 
starts  by  evaluating  the  integral: 


f 

J  D 


Ks(u)K3i(u  +  (v  -  w\/h)du. 


which  results,  by  a  change  of  variable  u  =  [u>  —  s]/h,  in  the  integral 

(B.14)  i  J*K,(‘w  -  s\/h)K,.([v  -  >]/*)*. 

where  x  and  y  will  be  determined  shortly.  Writing  out  the  formula  for  the 
boundary  kernel,  one  can  actually  perform  the  integration  in  closed  form: 

/  Ka[u)Kai([v  -  w]/h  +  u)du 
Jp 

(B.15)  =  /  (03  +  <t>3u)[Oai  +  ^,'(u  +  \w  ~  v)/h)K(u)K(u  +  [w  -  v]Jh)du. 

J  p 


The  kernel  if(u)  is  taken  to  be  the  biweight  kernel,  K(u)  =  a(l  —  u2)2  with 
a  =  15/16.  Substituting  this  form  of  K  into  equation  (B.15)  and  after  much 
simplification,  one  arrives  at  the  solution 


»=o 


bo  =  oo  d, 
bi  =  a\d  +  ao«, 

63  =  02<i  4-  aie  4-  ao/, 

63  =  <13 d  4-  a2«  -f  ai /, 

64  =  a^d  4-  03 1  4-  02/, 

65  =  05  d  +  a^e  4-  03/, 

=  a$d  4-  4-  04/, 


where 
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b 7  =  ayd  +  a%t  -I-  <15 /, 

6g  =  a%d  +  aje  +  a$f, 

bg  =  age  +  07/, 

*io  =  os/* 
oq  =  1  —  2e2  ■+•  c4, 
a!  =  4c(c2  -  1), 
a2  =  — 2(c4  —  5c2  +  2), 
a3  =  — 4c(2c2  -  3), 

<14  =  6  —  14c2  +  c4, 

05  =  4c(c2  —  3), 
ag  =  2(3c2  -  2), 

<17  =  4c, 
ag  =  1» 

e  =  (v  -  w)/h, 

d  =  +  0s4>s'c> 

e  =  Bsnf>a  +  09^a»  +  <bs<t>a>c, 

f  = 


The  relation  of  z  and  y  of  equation  (B.14)  and  p  and  q  of  equation  (B.15)  is 

w  -  y 


The  limits  x  and  y  need  to  be  determined  so  that 

Ka{[w  -  u\/h)K3i([v  -  uj/A) 

is  non-zero  a.e.  over  this  range  so  that  the  formulas  for  the  boundary  kernel 
employed  to  arrive  at  (B.15)  are  valid.  Recall  that  the  support  of  the  left-hand 
boundary  kernel  is  (—  l,aj  and  the  right-hand  boundary  kernel  is  [— s,  1].  The 
integrand  will  be  non-zero  outside  these  intervals  so  it  is  very  important  to  limit 
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the  range  of  integration.  The  limits  axe  given  for  the  various  cases  below.  Assume 
without  loss  of  generality  that  w  <  v. 

Case  1:  0  <  w  <  h  and  0  <  v  <  h.  Take  x  =  0  and  y  =  w  +  h  which  gives 
p  =  — 1  and  q  =  tv/h. 

Case  2:  0  <  w  <  h  and  h  <  v  <  1  —  h. 

Case  2a:  w  +  h  <  v  —  h  means  the  integral  is  zero. 

Case  2b:  w  +  h  >  v  —  h.  Take  x  —  v  -  h  and  y  =  w  +  h  which  gives 
p  =  —  1  and  q  =  1  —  c. 

Case  3:  0  <  t u  <  h  and  1  —  h  <  v  <  1. 

Case  3a:  w  +  h  <  v  —  h  implies  the  integral  is  zero. 

Case  3b:  w  +  h  >  v  —  h.  Take  x  =  v  —  h  and  y  =  w  +  h  which  gives 
p  =  — 1  and  q  =  1  —  c. 

Case  4:  h  <  w  <  1  —  h  and  h  <  v  <  1  —  h. 

Case  4a:  w  +  h  <  v  —  h  implies  the  integral  is  zero. 

Case  4b:  w  +  h  >  v  —  h.  Take  x  =  v  —  h  and  y  =  w  +  h  which  gives 
p  =  — 1  and  q  —  1  —  c. 

Case  5:  h  <  w  <  l  —  h  and  1  —  h  <  v  <  1. 

Case  5a:  w  +  h  <  v  —  h  implies  the  integral  is  zero. 

Case  5b:  w  +  h  >  v  —  h.  Take  x  =  v  —  h  and  y  =  w  +  h  which  gives 
p  =  —  1  and  q  =  1  —  c. 

Case  6:  1  —  h  <  w  <  1  and  1  —  h  <  v  <  1.  Take  x  =  v  —  h  and  y  =  1  which 
gives  p  =  (u>  —  l)/h  and  q  =  1  —  c. 


After  implementing  these  formulas,  one  finds  C/Jtu.v)  by  rewriting  it  as 


Chi'"’”)  =  ~  “  uJ/h)du 


-  1. 


Proof  of  Lemma  3.3.1 

To  prove  that  Ch(w,v)  is  continuous  on  the  unit  square,  it  must  be  shown  that 
for  all  e  >  0  there  exists  a  6  >  0  such  that 


(B.16) 


-<?/,(», t>)|  <  e, 
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if  ||(tui,vi)  —  (t/>,t>)||  <  6  where  ||  •  |j  is  the  Euclidean  norm.  Equation  (B.16) 
can  be  rewritten  as 


=  Ch(w i,  t>i)  -  Cfe(u/i,t/)  +  C*(u;i,t/)  -  Cfc(ti>,t>)| 

(B.17)  <  sup|pif,(l)|  ~  «]/*)  -  K{[v  -  u\/h)\du 

sup  ^2^a(t)|  Jq  |i^([u»l  -  «l/fc)  -  K{[w  -  tt]//l)|dtt. 


+ 


With  a  proof  completely  analogous  to  that  of  Lemma  3.2.1,  one  can  show  that 


sup  Jjf(|u’i  —  u]/h)  ~  K([w  ~  tt]/fc)|<fu  <  0(£)> 

|toi— ui|<6 

where  0(£)  — »  0  as  6  — ♦  0.  This  implies  that  (B.17)  can  be  made  smaller  than  e 
if  6  is  taken  sufficiently  small  and 


\wi  —  w\  <  6, 

|»l  —  t/|  <  6. 

Note  that  ||(u>i,vi)  —  (tu,v)||  <  6  implies  these  last  two  conditions.  It  must  also 
be  shown  that 

sup \Ka (f)  <  oo, 
a,t  1  1 

for  any  of  these  bounds  to  be  meaningful.  By  definition, 

K,(t)  =  (69  +  <t>st)K(t). 

For  the  supremum  to  be  infinite,  it  is  clear  that  either 

sup  1 1  =  oo 


sup|<k,|  =  00, 

3 


J 


or 
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since  K(t)  is  bounded  and  the  set  over  which  the  supremum  is  taken  with  respect 
to  t  is  also  bounded.  Without  loss  of  generality,  consider  the  left  hand  boundary 
kernel.  It  is  constructed  to  satisfy 


for  all  s  6  [0, 1].  Hence  the  supremum  of  Bs  and  4>a  must  be  bounded. 

It  has  now  been  shown  that  is  continuous  on  the  unit  square.  In 

fact,  C^(ti;,v)  is  uniformly  continuous  on  the  unit  square.  Continuity  implies 
that  Ch{w,v)  is  bounded  and  integrable.  Hence  both  integrals  given  in  parts  (ii) 
and  (iii)  of  the  lemma  are  finite. 

Proof  of  Lemma  3.3.2 

The  results  of  Lemma  3.3.2  will  be  proved  in  reverse  order.  Since  C/j(u;,v)  is 
continuous  on  the  unit  square,  the  series 

oo 

«?*(«.,#)  =  £*?(»)*?(•) 
i= 1 

converges  absolutely  and  uniformly  by  Mercer’s  theorem  [see  Shorack  and  Wellner 
(1986),  page  208].  Proposition  (ii)  of  Lemma  3.3.2  is  a  direct  consequence  of 
Lemma  3.3.1  and  Proposition  2  of  Shorack  and  Wellner,  page  208.  Proposition 
(i)  of  Lemma  3.3.2  is  that  the  eigenfunctions,  are  continuous  on  [0,1]. 

Start  with  the  series  representation  given  by  Mercer’s  theorem.  Since  this  series 
converges  absolutely,  it  must  be  that 

(B.18)  |*jH*J(v)|  <  oo, 

for  all  w,v  6  [0,1].  Since  is  an  eigenfunction,  by  assumption  <j>^{ w)  ^  0. 

Let  w  be  such  that  u)  0.  Combining  this  result  with  (B.18)  implies  the 
existence  of  My  such  that 

|<£y(v)|  <  My  <  OO, 

for  all  v  €  [0, 1  ] . 
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The  defining  equation  for  <f>^{ to)  and  leads  to 

=  ^h\fQ  [CaK"7')  ~  Ch{v,w)]<f>b{v)dv 

-^h  JQ  |cfc(®.w')  -  ch(vM  \4>j(v)\dv 

-  j  I^y(v)|dv,  if  |u/  -  u/|  <  S, 

(M, 


*r 


Thus  4>^{w)  is  continuous  on  [0,  Ij.  The  e  appears  because  C^(w,v)  is  uniformly 
continuous  on  the  unit  square. 

Proof  of  Lemma  3.3.3 

The  result  of  Lemma  3.3.3  was  outlined  in  the  text.  It  remains  only  to  specify 

some  of  the  details.  Choose  (6j, . . .  ,b^f)  6  JR.M  and  define 

M 

xn = r*zNi 


*=1 


where 


rl  M 

=  Jq  [53^^?(u»)]KDPJV,fcH^ 

=  /  R(t/;)KDP Nih(w)dw, 

Jo 

M 


RM  =  YLb^iM- 

i=i 

Clearly  R{w)  is  continuous  since  each  4>b{w)  is.  The  functional 

f(G)  =  /  R(w)G(w)dw 

Jo 

is  continuous  on  (C(0,  lj,p)  and  measurable  ( B,CP )  (see  Ruymgaart  (1988),  pages 
40  ff.).  By  Theorem  3.11  of  Ruymgaart  (1988),  one  can  conclude  that 

xK±x 


as  m  A  n  — *  oo,  where  X  is  given  by 


X=  [  JJMKDPfcMdu; 

Jo 

=  J  [E‘i*)'(®)]KDP*(“’)<i" 

i=  1 

M  x 

=  X>*  JQ  ^HKDP h(w)dw, 


since  integrals  are  linear  operators.  This  last  equality  is  clearly  YhiLiK^x-  ^ 
has  now  been  shown  that  for  (ij, . . . ,  bj^)  6  JR**  one  has 

M  M 

Y,hiZNi  Y,b<Zi- 

t=l  *=1 

It  may  be  concluded  by  the  Cramer- Wold  device  that 

{Zffi, ....  ZNM)  (Zlt ZM). 

Proof  of  Lemma  3.3.4 

Let  g{w)  be  any  element  in  Sq.  The  condition  for  Ch(w,v)  to  be  positive  semi- 
definite  is  that  there  exist  a  g  €  Sp  such  that 

/  f  g(w)C h(w ,v)g(v)dwdv  =  0. 

Jo  Jo 

The  integral  can  be  rewritten  in  the  following  manner: 

/  /  g(w)Ch{w,v)g{v)dwdv 

Jo  Jo 

=l!  lo,(w)^i!K’iiw~',],h) 

•  Ksi([v  —  u\jh)du  —  lj  g(v)dwdv 
(B.I9)  =  g{w)Ka{[w  -  u}/h)dw\ 

’[\f  9{v)K3,{[v  -  u]/h)dv\du  -  g{t)dt\  , 
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where  3  =  s(w,h)  and  s'  =  a(v,h).  Let 

r(«)  =  £  J  g(ti>)Jf3([ti>  -  u\/h)dw, 


and  note  that 


f  r(u)du  =  (  g{t)dt. 

Jo  Jo 


Substituting  r(u)  into  equation  (B.19),  one  obtains 

■l  r l 


where 

If  r(u)  =  c  then 

Conversely,  if 

(B.20) 

then 


ng(w)Cfl[wi  v)  g(v)dwdv 

i 

=  f  r(u)2du  -  n2 

Jo 

=  /  K«)  - 
Jo 

H  =  /  r(u)du. 

Jo 

[  [r(u)  -  fi]2du  =  f  [c  —  c]2du  =  0. 

Jo  Jo 

[  Ku)  ~  Mpdu  =  0 

Jo 


(B.21)  r(u)  =  c. 

This  follows  since  the  integrand  in  (B.20)  is  non-negative  and  r(u)  is  continuous. 
Putting  these  results  together  one  finds  the  following  two  results. 

1.  If  there  exists  a  9  €  Sp,  9  £  0.  such  that  r(u)  =  c  then  (B.20)  holds  and 
thus 


g(w)Cfl(w,  v)g(v)dwdv 


0. 


(B.22) 
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2.  If  there  exists  g  6  Sp,  g  £  0,  such  that  (B.22)  holds  then  (B.20)  also  holds 
and  hence  r(u)  =  c. 

Thus  there  exists  a  j  €  5^,  }  ^  0,  such  that  (B.22)  holds  if  and  only  if  there 
exists  a  g  Sp,  g  ^  0  such  that  (B.21)  holds. 

It  may  be  concluded  that  the  condition  that  there  exist  a  g  €  Sp  such  that 

1  f 1 

-  Ka(\w  -  u\/h)g(w)dw  =  c,  Vu  €  [0, 1] 

h  Jo 

is  equivalent  to  the  condition  for  positive  semi-definiteness. 

Proof  of  Lemma  3.3.5 

The  subset  chi-square  test  depends  on  the  data  only  through  the  components. 
The  components  are  invariant  when  centered  by  the  small  sample  mean.  Hence, 
the  chi-square  test  is  invariant. 

Proof  of  Lemma  3.3.6 

The  invariance  of  the  orthogonal  series  estimator  also  results  from  the  invariance 
of  the  components.  Lemma  3.3.5  states  that  the  set  of  components  selected  by 
the  subset  chi-square  test  will  be  the  same  irrespective  of  which  sample  is  called 
the  first.  This  result  is  due  to  the  invariance  of  the  subset  chi-square  test  applied 
to  the  components.  Let  ^  be  the  orthogonal  series  estimate  and  U*Ni  be  the 
unnormalized  components  when  the  population  with  distribution  function  F  is 
termed  the  first  sample.  Let  M  be  the  orthogonal  series  estimate  and  V^x  be 
the  unnormalized  components  when  the  population  with  distribution  function 
G  is  termed  the  first  sample.  Let  =  m/N  be  the  proportion  of  the  total 
sample  represented  by  the  population  with  distribution  function  F. 

The  claim  is  that 

X(N)dhtM(w)  +  U  ~  *(N) )^h, m(w)  =  1 

for  all  w  €  [0,1].  Let  the  normalized  components  be  Ujfi  and  V^,.  They  are 
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given  by 


and 


UNx  = 


NX 


\  (i  -  W 


Ufa 


INm 

n(fi 


Ufa, 


Vm  = 


N(l  -  \N]) 


$!>x 


(*) 


VNi 


The  invariance  condition  for  the  components  is  Ufa  =  — Vjyt-.  In  terms  of  Ufa 
and  vm  this  is 


(B.23) 


mUfa  =  —nVfa. 


Let  S  represent  the  set  of  components  in  the  models  and  write  out  the  invariance 
condition: 


+  f1  - 


ieS 


n 

+  N 


1  + 


ieS 


=  1  +  Jf  T,(mUk  + 

ieS 

=  1, 


in  light  of  equation  (B.23).  Thus,  the  orthogonal  series  estimator  is  invariant. 
Proof  of  Theorem  4.2.1 

The  proof  of  this  theorem  will  make  heavy  use  of  the  theorems  and  techniques  of 
Pyke  and  Shorack  (1968).  In  fact,  the  weak  convergence  will  be  shown  for  their 
process.  Since  the  process  CDo/y  is  asymptotically  equivalent  to  their  process 
(see  Section  2),  it  will  inherit  the  result. 


It  must  be  shown  that 


IlfcwM  -  MO  -  (i  -  Ao)-*/JA(()|| 

<  \\LK(t)  -  L0(t)\\  +  ||v^[Dfw)(<)  -  1]  -  (1  -  A0)-l/*A(l)||  -E.  o, 

as  m  A  n  — »  oo,  where 

loj»(0  =  VN\FmQHs(t)  -  1], 

LW(()  =  s/N\FmQHN(t)  -  Cfw)(()l, 

#(JV)(X)  =  A(W)f(x)  +  (!  -  A(JV))G(n)(x)> 

*>&)(<)  =  f <?(")('), 

£(t)  is  a  Brownian  bridge  and  ||  •  ||  denotes  the  sup-norm.  By  assumption,  one 
has  the  result 

II v^[/)fw)(()  -(]-(!-  A0)-1/2A(OII  -  0, 


Start  by  giving  an  alternate  representation  of  the  Pyke-Shorack  process,  Lpf(t), 
in  the  Skorohod  space 


i*(f)  =  (1  -  A,Ar,){-7i=Bw(t)£/m|F(jg(i)| 

VAW 

-  -;-----A„(<)Vn[GwQff(t)|}  +  MO. 

V1  “  V) 

where 


= 

0  =  [-^(yy)(ut)  -  iut  ~  0» 

ut  =  H(N)QNit)^ 
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^(iV)^Jv(f)  +  (1  “ 

MaKO 1  <  i/^N)’ 

|5jv(OI  <  !/(l  “  *(N))> 

<  !/(A(iV)V^V), 

C(AT)(0  =  G(n)Q(A)(t)> 

Um(t)  =  V^[FmQF(t)  -  i], 

Vn(t)  =  V^lG„Qfn)(0-«]. 

It  will  be  shown  that  supo<t<i  | Ljy(i)  -  L0(t)\  0  as  mAn  oo.  The  process 

L0  is  the  limiting  process  of  Lff  under  H0. 

The  proof  divides  the  interval  [0,1}  into  three  subintervals:  [0,  l/N),  [1/iV,  1  — 
1/JV],  and  [l  —  1/iV,  lj.  For  the  first  and  last  intervals,  the  goal  is  to  show  that 

anp  \LN(t)\  -^0, 
sup  |  L0  (t)  |  0, 

as  m  A  n  — »  oo  .  Start  by  examining  the  Ljv(t)  on  the  first  interval: 

sup  |Ljv(«)|  <  sup  —  +  VNDfN)(t), 
o<t<i/JV  0<t<i/N  m  v  1 

since 

£jv(()  =  y/N\FmQfot)  -  D(N](t)\, 

and 

Fm<3jJ(l)  <  i  on  0  <  I  <  1. 

Bound  on  [0, 1/iV]  in  the  following  manner: 

X{N)D(N)^)  +  (l~  X{N))D(N)W  =  1 

for  0  <  t  <  1,  so 

X{N)D(n)W  x(N))D(N)  -  * 
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and 


D(N)(t)  ^  x 


(If)  NX(N) 


on  [0,1/JV].  Hence 


0<  sup  \LN{t)\  <  ~- + -jJ- - 

0<t<i/N  m  VN  *(n) 


Similarly,  it  can  be  shown  that 


sup  |Lw(OI  -*■  0. 
l-l/A<t<i 

Equation  (2.5)  of  Aly,  Csorgo,  and  Horvith  (1987)  gives 

lim  P[  sup  |2J(t)j  >  6)  =  0, 

e— 0+  0<t<e 

lim  P[  sup  \B(t)\  >  6]  =  0, 
e— 0+  l-f<t<l 

for  all  6  >  0.  Let  c  =  l/N  and  choose  6  >  0,  then 


P[  sup  \LN(t)  -  L0(t)\  >  £] 

0<t<l/N 

<  P[  sup  |Ltf(t)|+  sup  \L0{t)\>6] 

0<t<l/N  0<t<l/A 

<  P[  sup  jLjy(t)|  >  6/2]  -f  P[  sup  \L0(t)\>6/2\ 

0<t<l/A  0<t<l/A 

-*  0  as  m  A  n  -*  oo. 


The  procedure  for  the  interval  [l  -  l/N,  1]  is  perfectly  analogous.  This  leaves  the 
interval  [l/iV,  1  —  l/N].  The  limiting  process,  L0{t),  can  be  represented  by 

Mt)  =  (i  -  . 

where  U0(t)  and  V0(t)  are  independent  Brownian  bridges  and  are  the  limiting 
processes  of  Um(t)  and  Vn(t),  respectively.  Consider  the  following  inequality: 

sup  | LN{t)  -  L0(t) | 

1/A<t<l-1/A 
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=  sup|(l  -  A(JV))  {  ~^=BN{i)Um\F <?#(()] 

yV) 

-  -7=^=Aw(()V„[G,„,gjJ(l)|  +%(!)} 

V1-V) 


<  Riff  +  Riff  +  R$ff  +  Riff  +  Rsn, 


where 


Am r  =  sup|[^-=B  - 

R2N  -  sup|l^  [flW(()Um|f  Qj?(!)l  -  </<,(!)]  |, 

R3N  =  sup  |6jv(0li 

R4ff  =  sup|^l-^0~  ^  “  \N) 

Rs/r  =  5UP|  v/r^[-4A'(«)V»|G(„)l?^Wl  -  V.w]  |- 


The  suprema  are  taken  over  the  range  l/N  <  t  <  1  —  1/N.  Henceforth,  if 
there  is  no  range  indicated  on  a  supremum,  it  is  assumed  to  be  over  the  interval 
[l/N,  1  —  1/N].  Each  of  the  terms  Riff  through  Ruff  must  be  shown  to  tend  to 
zero  in  probability  as  mAn  — ►  oo.  Terms  Riff  and  R4ff  tend  to  zero  in  probability 
because  — *•  Ao,  Um  and  Vn  are  each  bounded  in  probability  (their  suprema 

actually  converge  in  distribution  to  proper  random  variables),  and  Aff(t)  and 
Bff(t)  are  bounded  (see  above).  Now  examine  term  R2N  in  detail: 


sup|Btf(l)f/m|FQjJ(f)]-lUl)| 
<  Siff  +  s2n  +  Si n, 


where 

Sin  =  sup  |B„(«)I  •  \Vm\FQHN(t)\  -  Vc[FQfs ,(t)]|, 

s2N  =  sup  |fljv(*)|  •  |iMFQ(A,,MI  - 
S3ff  =  sup  |lfo(OI  ’  I Bff(t)  -  1|. 


The  term  Bpf(t)  is  bounded  as  given  above.  Theorem  2.2  of  Pyke  and  Shorack 
states  that 
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sup|cWF<jg(l)!  -  U„\FQfs)(t)\ |  0 

as  m  A  n  — +  oo  uniformly  in  all  continuous  F,  G  and  0  <  <  1.  So  it  matters 

not  that  G(x)  =  G^(x).  Thus  term  S ijy  converges  in  probability  to  zero. 

Since  ||FQ^y)(t)  -  <||  — ►  0  (by  Polya’s  theorem)  and  U0  is  uniformly  continuous 
almost  surely,  the  term  S2JV  converges  in  probability  to  zero. 

The  term  is  all  that  remains.  Rewrite  as 

"  - 1 

The  mean  value  theorem  implies  the  existence,  for  each  t  6  [l/N,  1  -  1  /N]  of 
$  =  spf(t)  between  t  and  such  that 

Bff(t)  =  d^)(s) 

__  _ g(n)(Q^)(a)l _ 

A(iv)/[<3f!vr)(5)l  +  U  -  A(jv))g(n)[Q^v)(s)] 

^  ^  1  ~  X(N)  +  A(AT)/(u)/g(n)(u)’ 

where  u  =  Q^yj(s)  =  and  i®  between  Q^(0  and  Qjy(t).  Define 

the  event  E#  to  be 

En  =  {a*  <  Q%{t)  <  bN  for  l/N  <  t  <  1  -  1/AT  j 
and  Ecn  to  be  its  complement.  By  assumption 


(B.25) 


Pf-E’yyj  — +lasmAn— +00. 


For  6  >  0: 


P|sup|B/y(!)  —  1|  >  5] 

=  P[sup|Bw(l)-l|><|f;Ar|P[£Ar] 

+  P(sup|BJV(i)-l|X|£^]-P[Ej,!. 


(B.26) 
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Consider  sup|£jv(t)  -  1|  given  that  EN  holds.  Using  (B.24)  one  can  rewrite 
I BN(t)  -  1|  as 

l*M0  -  1| 

A(hr)[l  -  /(«)/g(n)(tt)] 

1  -  A(iV)  +  A(A’}/(tt)/^(n)(u) 

Recall  that  u  =  t iff(t)  is  between  Q^)(f)  and  Q%{t)  so  that 
tN  <  ujv(t)  <  fN  for  1/N  <  t  <  1  -  l/N, 
where  eN  and  fN  are  as  defined  in  the  statement  of  Theorem  4.2.1.  Hence,  one 


sup  | BN(t)  -  1| 

X(N)  suP tN<x<fN  I1  ~  /(*)/g(n)(x)l 
—  1  ~  ^(N)  +  A(JV)  lnfet/<z<fN  f{x)/S(n){x) 

as  m  A  n  ->  oo  since  by  assumption, 


and 


sup  /(*)/?(„)(*)->  1» 
tN<X<fN 


/(l)/9(n)(l)"1’ 

as  m  A  n  — ♦  oo.  Hence,  it  has  been  shown  that 


P[  sup  | Bff(t)  -  1|  >  6\Etf]  -*  0. 
l/N<t<\-\/N 

Returning  to  equation  (B.26),  it  is  concluded  that 

P[  sup  \Bpt{t)  -  1\>  6\ 0 
l/N<t<l-l/N 

in  light  of  (B.25)  and  the  above  result.  Since  supo<t<i  \Uo{t)\  is  a  proper  random 
variable  (in  fact,  it  is  the  limiting  distribution  of  the  KS  statistic),  term  S3jv  tends 
to  zero  in  probability.  Term  R$n  behaves  in  an  analogous  fashion.  Hence  it  has 
been  proved  that 


|| ioN(t)  -  L0{t)  -  (i  -  ao)-1/2a(0(|  0, 
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in  the  Skorohod  construction  and  so  weak  convergence  results  for  the  original 
process. 

Proof  of  Lemma  4.2.1 

Lemma  4.2.1  gives  the  uniform  convergence  of  v/*[-D(jv)(«)  ~  u]  to  A(u)  for 
location  and  scale  alternatives.  The  complete  proof  will  be  given  for  location 
alternatives.  Changes  necessary  for  scale  alternatives  will  be  indicated  at  the 
end.  Recall,  by  assumption,  that  =  A0,  hence  D(^(u)  =  Z?^nj(u).  The 
n  enters  because  of  the  local  alternative,  m  does  not  enter.  Start  the  proof  by 
showing  some  useful  facts: 

Vn\DM(u)  -  u|  =  Vn{FQ^u)  -  tl| 

=  -  HW (“)«*)(“)] 

=  y/n(l  -  Ao)  [f -  F[Q^j(u)  -  l/vnl] 

(B.27)  =  (1  -  A „)-,/(«), 

where  c  =  cn(u)  is  between  and  Q^(u)  -  l/y/n.  Next  it  will  be  shown 

that  Q^(u)  |  QF  (u)  on  [6, 1  -  5]  for  each  0  <  6  <  1/2.  Start  with  the  identity 

u  =  #(„)«£)(“) 

=  AoF«£,(<0  +  (1  -  A0)F[«^,(u)  - i/V'SI- 
Differentiate  this  last  identity  with  respect  to  n: 

0=  A0/^n)(U)^fn)(u) 

+  (1  -  Ao)/[<?(^(“)  -  l/v'*\{-^Q{n)(u)  +T/(2nL5)). 

The  resulting  formula  for  (d/dn)Q^( u)  is 

±0H  ,  .  =  [(l-Aoh/(2n1-5)]/[C?^|M-7/v^| 

*>  W  W«f„,W  +  (1  -  Ao)/l<?£,(«)  -  lA/*| 

Thus  u)  1  QF{u)  for  each  u  (E  [6, 1  —  6]  for  0  <  6  <  1/2.  Since  QF{u)  is 
continuous  on  (6, 1—6],  conclude  by  Dini’s  theorem  that  Q ^  converges  uniformly 
to  Q^(u)  on  [6, 1  —  6]. 


The  preliminaries  are  now  taken  care  of  and  the  proof  of  the  lemma  may  begin. 
Choose  6  such  that  0  <  6  <  1/2  and  break  the  interval  [0, 1]  into  [0,6],  [6, 1  —  6], 
and  (l  —  6, 1].  Now  write 
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sup  |>/nlAn)(tt)-u]-A(u)| 
OCuCl1  v  '  1 


where 


Clearly, 


’0<u<l 

for  all  S  >  0  and  so 


<  Aln(6)  +  M n(f)  +  A3n(S)  +  A4n(S)  +  A5n(S), 


Aln(6)  =  sup  I  Vn[An)(tt)  -  tt] 

0<u<6'  V  ’ 

i42n(5)  =  sup  |A(u)j, 

0<u<6 

A3n(6)  =  sup  k/«[-D(n)(u)  -  U1  -  A 
6<u<l— 61 

yl4n(6)  =  sup  |v/n[An)(u) -u] 
1-6<u<1' 

>4n(£)  =  SUp  |A(u)|. 

1-6<u<1 


O 

sup  |vn[0(n)(u)  -  uj  -  A(u)|  <  ^lim^ ^(6), 


1  =  1 


SUP  |>/»l^(n)(«)~«]-A(tt)  <  Urn  lim  ^^(6). 
n— *°°o<u<i|  '  1  6-+  0+  n—*  oo4— » 


6-0+ 


*=1 


The  strategy  will  be  to  evaluate  the  limits  on  the  right  of  this  last  inequality. 
Start  with  A\n(6): 

sup  k/n[£>(n)(u)  -  u]  =  sup  (1  -  *o)if(c) 


0<u<6' 

from  (B.27),  and  so 


sup  Jv/n[£>(n)(u)  -  «](  <  ^up  (1  -  *ohfQ(i)(u), 


0<u<6 


since  c  =  cN(u)  <  and  fQ^(u)  ^Oasu^O+.  Then 

sup  |%/n[D(nj(tt) -a]  <  sup  (1  -  A  ob/Qm(u), 

0<u<^ 1  '  '  0 <u<6  '  • 

since  Qf  (u)  >  Q^(u)  and  /Q^(u)  ^Oasu^O+.  Thus, 

«!!.“  '/sl£,w(u)  -  “I  |  s  „!>m  (i  -  A»>\j“|/<5(!)(u> 

=  cHin  (l  “  A0 b  SUP  /Qm(“)  =  0, 

5—0+  0<u<£  1 1 

since  limu_o+  fQF{u)  —  0  and  /  is  continuous. 

For  Ain  one  has 

Km  Jim  s«P  l^(u)|  =0, 

8  ►0+  n— *°°o<u<£ 

since  A  is  continuous  and  A(0)  =  0. 

For  Azn,  one  has  by  equation  (B.27)  that 

sup  |\/n[I>(n)(u)  -  u]  -  A(u)|  =  sup  (1  -  A0b[/(e)  -  fQF{u)}  . 
8<u<l-6'  1  6<u<l-6 

Expand  /(c)  about  QF(u)  to  arrive  at 

sup  |v^[Z>(„)(u)-u]-A(u)  =  sup  (1  —  A0)-y/'(<i)[c  —  QjP(u)]  , 
8<u<l-6'  6<v<l-5  " 

where  d  =  dn(u)  is  between  c  =  cn(u)  and  QF(u).  One  can  write 

!c  -  <  max(  Q^u)  -  g^j(i»)|,  | QF{u)  -  Q^(u)  -  nr/>/»  ) 

<  </(«)- Qfn)(«)  +'i/y/n , 

since  c  is  between  Q ^  and  Q ^  -  'i/y/n.  Substituting  this  result  in  the  above 
formula  yields 

sup  k/n[D/n)(u)  -  u)  -  A(u) 

8<u<\~8'  1 

<  tf<s«P  U  “  A ob!/'(d)|  •  [|c?F(u)  -  Q^j(u)  +  Tr/v/n] 


0, 
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as  n  — »  oo  for  all  0  <  6  <  1/2  since  /'  is  bounded  and  Q ^  converges  uniformly 
to  QF  on  all  intervals  of  this  form.  Hence 

Jim  Jim  sup  k/n(An)(«)  ~  “1  ~  A(tt) 

5—  0+  n  o°(5<uSl-i5  V 

=  lim  0  =  0. 

5—0+ 

The  terms  A^n  and  A$n  behave  the  same  as  Ain  and  A2n,  respectively. 


The  procedure  for  scale  alternatives  is  very  similar.  In  this  case,  one  finds  that 

d  H  (l-^oh/[Q^)/(H-T/v^)]Q^j(u)/[2n1-5(l+Vv^)2] 

dnQ(n )  “  A0 fQfa(u)  +  (1  -  A0)/[Q^)(u)/(l  +  i/y/n)  1/(1  +  n/y/n)  ’ 

which  has  at  most  one  sign  change  so  that  the  convergence  is  still  uniform  on 
[6, 1  —  6j.  In  this  case,  the  mean  value  theorem  gives 

>/«[£>(n)(u)  -«]  =  (!“  A0b/(e)Q*) («)/(!  +  l/y/n), 

where  c  =  cn(u)  is  between  and  Q^j(u)/(1  +  7/ y/n ).  The  term 

V*[D{n)(u)  —  u]  —  A(u)  can  be  written  as 

(1  -  A0b[/(c)[Q£)(u)  -Q/'(u)]/(l  +  i/y/n) 

+  QF(u)[/(c)  -  fQF (“)]/(!  +  l/Vn) 

+  fQF(u)QF{u)[l/{l  +  'l/Vn)  -  1]]. 

The  intervals  [0, 5]  and  [1  —  6,  l]  Me  handled  as  before.  The  interval  [6, 1  —  6]  uses 
the  result  just  above  and  the  uniform  convergence  of  Q ^  to  QF  on  [<5, 1  —  5]. 
Hence,  the  convergence  of  ^/n[JD^nj(u)  -  u]  to  A(u)  is  uniform  for  both  location 
and  scale  alternatives  satisfying  the  conditions  of  Lemma  4.2.1. 


Proof  of  Theorem  4.2.2 


g(t,x)  =  cos(fx)Re[0(t)]  +  sin(tx)Im[<£(t)], 

M  N 

f  N,m{x)  =  +  2M{j  -  l)/N,x), 

y= 1 


Define 
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7T  1  1 

Fn, A/(*Jfc)  =  ~  2^M(x l)  ”  2fNM(xk)\, 

J=1 

for  Xk  =  ~,  k  =  0,...,[N/2\, 

FN,m(x)  =  JQ  fN,M^dt- 

The  function  F^fj^(b)  is  defined  by  linear  interpolation  if  b  cannot  be  written  in 
the  form  irk/M,  for  some  integer  k  between  0  and  [N/ 2]. 

The  proof  that  Fpf  j^(b)  — »  F(b)  will  be  divided  into  two  parts.  First  rewrite 

4,mW  -  m  as 

-  FW  =  FnmW  ~  ^V,m(4)]  + 

Each  of  the  two  terms  will  be  shown  to  tend  to  zero. 

Define  zj  to  be  the  nearest  z  less  than  or  equal  to  b  such  that 

nk 

Xi  =  M' 

for  some  integer  k,  0  <  k  <  [N / 2].  Let  xa  be  the  next  greatest  x  of  this  form: 

*  +  l 

=  "HT 

Clearly  xa  l  b  and  zj  |  b  as  M,N  — »  oo.  Since  Fjv  j^(z^)  is  approximat¬ 
ing  Fff  Af(xj)  by  the  trapezoidal  rule,  one  has  the  bound  [see  Press,  Flannery, 
Teukolsky,  and  Vetterling  (1986),  page  105] 

\fn,mM  ~  *jv,a/(6)| 

<  \FN,M(xb)  -  ■fiV>A/(2:6)|  +  FNM(xb)  -  FNiM(b) | 

(B.28)  <  sup  °{f>/M2)  +  \fNiM{b){xb-b)\  +  °{xb-b)’ 

0 <x<b  <*x 

since  k  ~  bM/ir  for  large  M  and  the  trapezoidal  bound  is 

0{a3/n2)  sup  |/"(z)| 

0  <x<o 
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when  integrating  /(x)  over  the  range  [0,  o]  with  n  grid  points.  Young’s  form 
of  Taylor’s  theorem  is  used  to  derive  the  last  two  terms  of  (B.28).  The  term 
//V,A/(^)  can  be  seen  to  converge  to  f(b)  by  the  results  below  and  so 

|/n,A/(&)(26  ~b)\~*0 


A 

as  TV,  M  — »  oo  and  M/N  — ►  0.  The  same  result  holds  for  Fff  \f(xa).  The  second 
derivative  with  respect  to  x  of  /jv,A/(z)  can  be  bounded: 


^  \  [77  ^  L|^2lmW] 

y=i  3= 1 


where  tj  =  —M  +  2 (j  —  l)M/N.  The  expression  in  the  brackets  is  converging  to 


Re 


[^(t)]|dt  +  J  |t2Im[^(t)]|dt, 


as  M,N  -*  00,  and  M/N  — *•  0  and  so  is  bounded  since  these  integrals  are  by 
assumption.  One  may  conclude  that 


fN,m{xo)  ~  FNjM{b)  ->  0, 
fn,mM  ~  fn,m  (*)  -*■ 

as  M,  N  — »  00  and  M/N  — *•  0  because  the  differences  are  0(b/M2).  Since 
FNM(b)  is  between  Fs,M^xb)  and  ^V,Af(za)>  it  may  be  concluded  that 

FN,Mib)  ~  FN,\f(b)  °» 


as  N,  M  — ►  00  and  M/N  — ►  0. 

Next  it  will  be  shown  that  Fff  ^(b)  —  F(b)  is  tending  to  zero.  Write 

Fn,mW  -  m  =  [  \!h,m(x)  -  f(z)]dx 

Jo 

=  /  (/n,m(z)  -  /m(z)Wz  +  /  [/m(z)  - 
Jo  Jo 
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where 


1 

f\f(x)  =  ^z  g{t,*)dt. 
2jr  J-M 


Examine  these  terms  separately: 
»b 


\Jq  \Jn,m^x)  -  /m(x)]<*x|  <  Jq  | /n,m(x)  ~  fia(x) 


dx 


<b •  sup  \ftr,M(x)  ~  Mx)\ 

0<x<b 


<  sup 
t^o,xe[o,fc] 


h9{t’x)\ 


M3 

N2 


0, 


since  Af3/JV2  — ►  0  and  (d2 /dt2)g(t,x)  is  bounded  by  assumption.  The  point 
t  =  0  is  not  included  since  it  is  always  the  endpoint  of  a  sub-interval  for  N  even. 
The  points  at  which  the  derivative  of  g(t,x)  is  evaluated  are  in  the  interior  of 
these  sub-intervals. 


Now  handle  the  next  term: 

f  UmW  ~f[x)]dx  =  f  f  g(t,x)dtdx  —  f  f(x)dx 

Jo  2x/0  J-M  Jo 

=  hf  f  9{t,x)dxdt-J  f(x)dx. 
2*  J-M  Jo  Jo 

Billingsley  (1986),  page  356,  proves  that 


1  fM  fb 
—  /  /  g(t,x)dxdt  — *  F(b)  as  M 

2ir  J-M  Jo 


oo. 


Thus 


/  1/m(x)  -  fix)]dx  -»  0  as  M  -*■ 

Jo 

Thus  the  result  Fjg  j^(b)  ~  F{b)  0  has  been  achieved. 


Proof  of  Lemma  4.2.2 

First  find  the  moment  generating  function  of  Y  —  (Z  -t-  d)'(Z  +  d),  where 
Z  ~  Nq( 0,  V)  and  V  =  diag(t/i, . . . ,  vq).  Let  my(f)  be  the  moment  generating 
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function.  Then 

my(t)  "  L  exp[t{2  +  d)>{z  +  <f)1  exp 
=  [  _ 

•  expj^-[z  -  2(V~ 1  -  2f/)-1td]/(V-1  -  2 tl)\z  -  2(v~l  -  2t/)~1tci]/2j 

•  exp^d  +  2*V(K"1  -  2t/)~1d] 

(V'-1  -  2tl)~l  1/2  .  . 

=  — (fKjv* — «*[«•  + *V,-*D-,I«|] 

'  L  “•’H*  -  -  2,,)<2  -  a)l2\iz 

|(V’“1  -2t/)~1  1/2  r  . 

=  — OK)172 — «*p[^i‘+»vi-»/)  H 

Let 

A  =  V-1  —  2tl  —  diag(l/wy  —  2t), 

which  implies  that 

(V-1  -2tJ)-‘|1/2  =  Q  1  1/2 

(n«y)‘'J  J  ' 

and 

;  =  1  ^ 

Formally,  by  substituting  t  =  it,  one  finds  the  characteristic  function  to  be 

Q 

*r® = n  ( !  _  2, 1., ) 1/2  exp  W*1  (i  ~  2itvt  >]  ■ 

i= i  1 

Identifying  Vj  =  Bj  and  dj  =  y/0~6y  yields  the  resuit. 
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